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Preface 


Purpose 


Our objective is to provide a post-calculus introduction to the subject of probability 
that 
¢« Has mathematical integrity and contains some underlying theory 
¢ Shows students a broad range of applications involving real problem scenarios 
¢ Is current in its selection of topics 
¢ Is accessible to a wide audience, including mathematics and statistics majors 
(yes, there are a few of the latter, and their numbers are growing), prospective 
engineers and scientists, and business and social science majors interested in the 
quantitative aspects of their disciplines 
¢ Illustrates the importance of software for carrying out simulations when answers 
to questions cannot be obtained analytically 
A number of currently available probability texts are heavily oriented toward a 
rigorous mathematical development of probability, with much emphasis on 
theorems, proofs, and derivations. Even when applied material is included, the 
scenarios are often contrived (many examples and exercises involving dice, coins, 
cards, and widgets). So in our exposition we have tried to achieve a balance 
between mathematical foundations and the application of probability to real- 
world problems. It is our belief that the theory of probability by itself is often not 
enough of a “hook” to get students interested in further work in the subject. 
We think that the best way to persuade students to continue their probabilistic 
education beyond a first course is to show them how the methodology is used in 
practice. Let’s first seduce them (figuratively speaking, of course) with intriguing 
problem scenarios and applications. Opportunities for exposure to mathematical 
rigor will follow in due course. 


Content 


The book begins with an Introduction, which contains our attempt to address the 
following question: “Why study probability?” Here we are trying to tantalize 
students with a number of intriguing problem scenarios—coupon collection, birth 
and death processes, reliability engineering, finance, queuing models, and various 
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conundrums involving the misinterpretation of probabilistic information (e.g., 
Benford’s Law and the detection of fraudulent data, birthday problems, and the 
likelihood of having a rare disease when a diagnostic test result is positive). Most of 
the exposition contains references to recently published results. It is not necessary 
or even desirable to cover very much of this motivational material in the classroom. 
Instead, we suggest that instructors ask their students to read selectively outside 
class (a bit of pleasure reading at the very beginning of the term should not be an 
undue burden!). Subsequent chapters make little reference to the examples herein, 
and separating out our “pep talk” should make it easier to cover as little or much as 
an instructor deems appropriate. 

Chapter | covers sample spaces and events, the axioms of probability and 
derived properties, counting, conditional probability, and independence. Discrete 
random variables and distributions are the subject of Chap. 2, and Chap. 3 
introduces continuous random variables and their distributions. Joint probability 
distributions are the focus of Chap. 4, including marginal and conditional 
distributions, expectation of a function of several variables, correlation, modes of 
convergence, the Central Limit Theorem, reliability of systems of components, the 
distribution of a linear combination, and some results on order statistics. These four 
chapters constitute the core of the book. 

The remaining chapters build on the core in various ways. Chapter 5 introduces 
methods of statistical inference—point estimation, the use of statistical intervals, 
and hypothesis testing. In Chap. 6 we cover basic properties of discrete-time 
Markov chains. Various other random processes and their properties, including 
stationarity and its consequences, Poisson processes, Brownian motion, and 
continuous-time Markov chains, are discussed in Chap. 7. The final chapter 
presents some elementary concepts and methods in the area of signal processing. 

One feature of our book that distinguishes it from the competition is a section at 
the end of almost every chapter that considers simulation methods for getting 
approximate answers when exact results are difficult or impossible to obtain. 
Both the R software and Matlab are employed for this purpose. 

Another noteworthy aspect of the book is the inclusion of roughly 1100 
exercises; the first four core chapters together have about 700 exercises. There 
are numerous exercises at the end of each section and also supplementary exercises 
at the end of every chapter. Probability at its heart is concerned with problem 
solving. A student cannot hope to really learn the material simply by sitting 
passively in the classroom and listening to the instructor. He/she must get actively 
involved in working problems. To this end, we have provided a wide spectrum of 
exercises, ranging from straightforward to reasonably challenging. It should be easy 
for an instructor to find enough problems at various levels of difficulty to keep 
students gainfully occupied. 
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Mathematical Level 


The challenge for students at this level should be to master the concepts and 
methods to a sufficient degree that problems encountered in the real world can be 
solved. Most of our exercises are of this type, and relatively few ask for proofs or 
derivations. Consequently, the mathematical prerequisites and demands are 
reasonably modest. Mathematical sophistication and quantitative reasoning ability 
are, of course, crucial to the enterprise. Univariate calculus is employed in the 
continuous distribution calculations of Chap. 3 as well as in obtaining maximum 
likelihood estimators in the inference chapter. But even here the functions we ask 
students to work with are straightforward—generally polynomials, exponentials, 
and logs. A stronger background is required for the signal processing material at the 
end of the book (we have included a brief mathematical appendix as a refresher for 
relevant properties). Multivariate calculus is used in the section on joint 
distributions in Chap. 4 and thereafter appears rather rarely. Exposure to matrix 
algebra is needed for the Markov chain material. 


Recommended Coverage 


Our book contains enough material for a year-long course. An instructor must be 
selective when using it in a course of shorter duration. To give a sense of what 
might be reasonable, we now describe two different courses at our home institution, 
California Polytechnic State University (in San Luis Obispo, CA), for which this 
book is appropriate. The university is on a quarter rather than a semester calendar. 
Depending on the quarter, there are between 38 and 41 class meetings (not 
including the final exam), each one lasting for 50 min. Courses are more commonly 
taught on the semester calendar, with three 50-min meetings or two 75-min 
meetings per week. This would allow for either more intensive work in the core 
chapters or a bit more time spent on selected material from Chaps. 5 to 8 than what 
we indicate below. 

The first course is Introduction to Probability Models. It has calculus and linear 
algebra prerequisites, but no prior exposure to probability or statistics is assumed. 
The course is taken by mathematics and statistics majors plus a handful of engi- 
neering and quantitative finance students who are either in their sophomore or 
junior year. Coverage of material is roughly as follows: 


Topic Number of class meetings 
Introduction to Probability (Ch. 1) i 
Discrete Random Variables and Distributions (Ch. 2) 7 
Continuous Random Variables and Distributions (Ch. 3) 7 
Joint Probability Distributions (Ch. 4) Wl 
Markov Chains (Ch. 6) 5 
Selected Material from Reliability/Random Processes (Chs. 4, 7) 3} 
36 


viii Preface 


Virtually all of the probability material in Chap. | is covered as well as much of 
what appears in Chap. 2. General properties of continuous distributions, the normal 
distribution, and the exponential distribution are featured, whereas other continuous 
distributions and transformations receive little attention. In Chap. 4, joint continu- 
ous distributions are covered very lightly and quickly; correlation, the Central Limit 
Theorem, and linear combinations are emphasized, but properties of conditional 
expectation and transformations are just briefly mentioned and order statistics are 
off limits. The Markov chain chapter contains more material than can be covered in 
the course, and no inference or signal processing is included. 

The second course for which the book is intended is Probability and Random 
Processes for Engineers. The audience consists primarily of computer and electri- 
cal engineering majors in their third year of study. Their mathematical background 
is more extensive than that of at least some of the students in the probability models 
course, with exposure to differential equations and Fourier analysis. The following 
table describes coverage of material: 


Topic Number of class meetings 
Introduction to Probability (Ch. 1) 

Discrete Random Variables and Distributions (Ch. 2) 
Continuous Random Variables and Distributions (Ch. 3) 
Joint Probability Distributions (Ch. 4) 

Selected Material from Random Processes (Ch. 7) 
Continuous-Time Signal Processing (Ch. 8) 


NYADUADA AD 


36 


Once again instructors must be judicious in covering material from the core 
chapters in order to leave sufficient time for the more advanced material from 
Chaps. 7 and 8 (this syllabus was developed jointly with the relevant engineering 
departments). 

We are able to cover as much material as indicated on the foregoing syllabi with 
the aid of a not-so-secret weapon: we prepare and require that students bring to class 
a course booklet. The booklet contains most of the examples we present as well as 
some surrounding material. A typical example begins with a problem statement and 
then poses several questions (as in the exercises in this book). After each posed 
question there is some blank space so that the student can either take notes as the 
solution is developed in class or else work the problem on his/her own if asked to do 
so. Because students have a booklet, the instructor does not have to write as much 
on the board as would otherwise be necessary and the student does not have to do as 
much writing to take notes. Both the instructor and the students benefit. 

We also like to think that students can be asked to read an occasional subsection 
or even section on their own and then work exercises to demonstrate understanding, 
so that not everything needs to be presented in class. For example, we have found 
that assigning a take-home exam problem that requires reading about the Weibull 
and/or lognormal distributions is a good way to acquaint students with them. But 
instructors should always keep in mind that there is never enough time in a course 
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of any duration to teach students all that we’d like them to know. Hopefully 
students will like the book enough to keep it after the course is over and use it as 
a basis for extending their knowledge of probability! 
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A Final Thought 


It is our hope that students completing a course taught from this book will feel as 
passionately about the subject of probability as we still do after so many years of 
living with it. Only teachers can really appreciate how gratifying it is to hear from a 
student after he/she has completed a course that the experience had a positive 
impact and maybe even affected a career choice. 


San Luis Obispo, CA Matthew A. Carlton 
San Luis Obispo, CA Jay L. Devore 
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Introduction: Why Study Probability? 


Some of you may enjoy mathematics for its own sake—it is a beautiful subject 
which provides many wonderful intellectual challenges. Of course students of 
philosophy would say the same thing about their discipline, ditto for students of 
linguistics, and so on. However, many of us are not satisfied just with aesthetics and 
mental gymnastics. We want what we’re studying to have some utility, some 
applicability to real-world problems. Fortunately, mathematics in general and 
probability in particular provide a plethora of tools for answering important profes- 
sional and societal questions. In this section, we’ll attempt to provide some prelim- 
inary motivation before forging ahead. 

The initial development of probability as a branch of mathematics goes back 
over 300 years, where it had its genesis in connection with questions involving 
games of chance. One of the earliest recorded instances of probability calculation 
appeared in correspondence between the two very famous mathematicians, Blaise 
Pascal and Pierre de Fermat. The issue was which of the following two outcomes of 
die-tossing was more favorable to a bettor: (1) getting at least one 6 in four rolls of a 
fair die (‘‘fair” here means that each of the six outcomes 1, 2, 3, 4,5, and 6 is equally 
likely to occur) or (2) getting at least one pair of 6s when two fair dice are rolled 
24 times in succession. By the end of this chapter, you shouldn’t have any difficulty 
showing that there is a slightly better than 50-50 chance of (1) occurring, whereas 
the odds are slightly against (2) occurring. 

Games of chance have continued to be a fruitful area for the application of 
probability methodology. Savvy poker players certainly need to know the odds of 
being dealt various hands, such as a full house or straight (such knowledge is 
necessary but not at all sufficient for achieving success in card games, as such 
endeavors also involve much psychology). The same holds true for the game of 
blackjack. In fact, in 1962 the mathematics professor Edward O. Thorp published 
the book Beat the Dealer; in it he employed probability arguments to show that as 
cards were dealt sequentially from a deck, there were situations in which the 
likelihood of success favored the player rather than the dealer. Because of this 
work, casinos changed the way cards were dealt in order to prevent card-counting 
strategies from bankrupting them. A recent variant of this is described in the paper 
“Card Counting in Continuous Time” (Journal of Applied Probability, 2012: 
184-198), in which the number of decks utilized is large enough to justify the use 
of a continuous approximation to find an optimal betting strategy. 
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In the last few decades, game theory has developed as a significant branch of 
mathematics devoted to the modeling of competition, cooperation, and conflict. 
Much of this work involves the use of probability properties, with applications in 
such diverse fields as economics, political science, and biology. However, espe- 
cially over the course of the last 60 years, the scope of probability applications has 
expanded way beyond gambling and games. In this section, we present some 
contemporary examples of how probability is being used to solve important 
problems. 


Software Use in Probability 


Modern probability applications often require the use of a calculator or software. Of 
course, we rely on machines to perform every conceivable computation from 
adding numbers to evaluating definite integrals. Many calculators and most com- 
puter software packages even have built-in functions that make a number of specific 
probability calculations more convenient; we will highlight these throughout the 
text. But the real utility of modern software comes from its ability to simulate 
random phenomena, which proves invaluable in the analysis of very complicated 
probability models. We will introduce the key elements of probability simulation in 
Sect. 1.7 and then revisit simulation in a variety of settings throughout the book. 

Numerous software packages can be used to implement a simulation. We will 
focus on two: Matlab and R. Matlab is a powerful engineering software package 
published by MathWorks; many universities and technology companies have a 
license for Matlab. A freeware package called Octave has been designed to imple- 
ment the majority of Matlab functions using identical syntax; consult http://www. 
gnu.org/software/octave/. (Readers using Mac OS or Windows rather than 
GNU/Linux will find links to compatible versions of Octave on this same website.) 
R is a freeware statistical software package maintained by a core user group. The R 
base package and numerous add-ons are available at http://cran.r-project.org/. 

Throughout this textbook, we will provide side by side Matlab and R code for 
both probability computations and simulation. It is not the goal, however, to serve 
as a primer in either language (certainly, some prior knowledge of elementary 
programming is required). Both software packages have extensive help menus and 
active online user support groups. Readers interested in a more thorough treatment 
of these software packages should consult Matlab Primer by Timothy A. Davis or 
The R Book by Michael J. Crawley. 


Modern Application of Classic Probability Problems 


The coupon collector problem has been well known for decades in the probability 
community. As an example, suppose each box of a certain type of cereal contains a 
small toy. The manufacturer of this cereal has included a total of ten toys in its 
cereal boxes, with each box being equally likely to yield one of the ten toys. 
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Suppose you want to obtain a complete set of these toys for a young relative or 
friend. Clearly you will have to purchase at least ten boxes, and intuitively it would 
seem as though you might have to purchase many more than that. How many boxes 
would you expect to have to purchase in order to achieve your goal? Methods from 
Chap. 4 can be used to show that the average number of boxes required is 10(1 + 1/2 
+1/3+---+1/10). If instead there are n toys, then ” replaces 10 in this expression. 
And when n is large, more sophisticated mathematical arguments yield the approx- 
imation n(In(n) + .58). 

The article “A Generalized Coupon Collector Problem” (Journal of Applied 
Probability, 2011: 1081-1094) mentions applications of the classic problem to 
dynamic resource allocation, hashing in computer science, and the analysis of 
delays in certain wireless communication channels (in this latter application, 
there are n users, each receiving packets of data from a transmitter). The generali- 
zation considered in the article involves each cereal box containing d different toys 
with the purchaser then selecting the least collected toy thus far. The expected 
number of purchases to obtain a complete collection is again investigated, with 
special attention to the case of n being quite large. An application to the wireless 
communication scenario is mentioned. 


Applications to Business 


The article “Newsvendor-Type Models with Decision-Dependent Uncertainty” 
(Mathematical Methods of Operations Research, 2012, published online) begins 
with an overview of a class of decision problems involving uncertainty. In the 
classical newsvendor problem, a seller has to choose the amount of inventory to 
obtain at the beginning of a selling season. This ordering decision is made only 
once, with no opportunity to replenish inventory during the season. The amount of 
demand D is uncertain (what we will call in Chap. 2 a random variable). The cost of 
obtaining inventory is c per unit ordered, the sale price is r per unit, and any unsold 
inventory at the end of the season has a salvage value of v per unit. The optimal 
policy, that which maximizes expected profit, is easily characterized in terms of the 
probability distribution of D (this distribution specifies how likely it is that various 
values of D will occur). 

In the revenue management problem, there are S units of inventory to sell. Each 
unit is sold for a price of either 7; or rz (7 > 2). During the first phase of the selling 
season, customers arrive who will buy at the price r but not at r,. In the second 
phase, customers arrive who will pay the higher price. The seller wishes to know 
how much of the initial inventory should be held in reserve for the second phase. 
Again the general form of the optimal policy that maximizes expected profit is 
easily determined in terms of the distributions for demands in the two periods. The 
article cited in the previous paragraph goes on to consider situations in which the 
distribution(s) of demand(s) must be estimated from data and how such estimation 
affects decision making. 
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A cornerstone of probabilistic inventory modeling is a general result established 
more than 50 years ago: Suppose that the amount of inventory of a commodity is 
reviewed every T time periods to decide whether more should be ordered. Under 
rather general conditions, it was shown that the optimal policy—the policy that 
minimizes the long-run expected cost—is to order nothing if the current level of 
inventory is at least an amount s but to order enough to bring the inventory level up 
to an amount S if the current level is below s. The values of s and S are determined 
by various costs, the price of the commodity, and the nature of demand for the 
commodity (how customer orders and order amounts occur over time). 

The article “A Periodic-Review Base-Stock Inventory System with Sales Rejec- 
tion” (Operations Research, 2011: 742-753) considers a policy appropriate when 
backorders are possible and lost sales may occur. In particular, an order is placed 
every T time periods to bring inventory up to some level $. Demand for the 
commodity is filled until the inventory level reaches a sales rejection threshold 
M for some M<S. Various properties of the optimal values of M and S are 
investigated. 


Applications to the Life Sciences 


Examples of the use of probability and probabilistic modeling can be found in many 
subdisciplines of the life sciences. For example, Pseudomonas syringae is a bacte- 
rium which lives in leaf surfaces. The article “Stochastic Modeling of Pseudomonas 
Syringae Growth in the Phyllosphere” (Mathematical Biosciences, 2012: 106-116) 
proposed a probabilistic (synonymous with “stochastic”) model called a birth and 
death process with migration to describe the aggregate distribution of such bacteria 
and determine the mechanisms which generated experimental data. The topic of 
birth and death processes is considered briefly in Chap. 7 of our book. 

Another example of such modeling appears in the article “Means and Variances 
in Stochastic Multistage Cancer Models” (Journal of Applied Probability, 2012: 
590-594). The authors discuss a widely used model of carcinogenesis in which 
division of a healthy cell may give rise to a healthy cell and a mutant cell, whereas 
division of a mutant cell may result in two mutant cells of the same type or possibly 
one of the same types and one with a further mutation. The objective is to obtain an 
expression for the expected number of cells at each stage and also a quantitative 
assessment of how much the actual number might deviate from what is expected 
(that is what “variance” does). 

Epidemiology is the branch of medicine and public health that studies the causes 
and spread of various diseases. Of particular interest to epidemiologists is how 
epidemics are propagated in one or more populations. The general stochastic 
epidemic model assumes that a newly infected individual is infectious for a random 
amount of time having an exponential distribution (this distribution is discussed in 
Chap. 3) and during this infectious period encounters other individuals at times 
determined by a Poisson process (one of the topics in Chap. 7). The article “The 
Basic Reproduction Number and the Probability of Extinction for a Dynamic 
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Epidemic Model” (Mathematical Biosciences, 2012: 31-35) considers an extension 
in which the population of interest consists of a fixed number of subpopulations. 
Individuals move between these subpopulations according to a Markov transition 
matrix (the subject of Chap. 6) and infectives can only make infectious contact with 
members of their current subpopulation. The effect of variation in the infectious 
period on the probability that the epidemic ultimately dies out is investigated. 

Another approach to the spread of epidemics is based on branching processes. 
In the simplest such process, a single individual gives birth to a random number of 
individuals; each of these in turn gives birth to a random number of progeny, and so 
on. The article “The Probability of Containment for Multitype Branching Process 
Models for Emerging Epidemics” (Journal of Applied Probability, 2011: 173-188) 
uses a model in which each individual “born” to an existing individual can have one 
of a finite number of severity levels of the disease. The resulting theory is applied to 
construct a simulation model of how influenza spread in rural Thailand. 


Applications to Engineering and Operations Research 


We want products that we purchase and systems that we rely on (e.g., communica- 
tion networks, electric power grids) to be highly reliable—have long lifetimes and 
work properly during those lifetimes. Product manufacturers and system designers 
therefore need to have testing methods that will assess various aspects of reliability. 
In the best of all possible worlds, data bearing on reliability could be obtained under 
normal operating conditions. However, this may be very time consuming when 
investigating components and products that have very long lifetimes. For this 
reason, there has been much research on “accelerated” testing methods which 
induce failure or degradation in a much shorter time frame. For products that are 
used only a fraction of the time in a typical day, such as home appliances and 
automobile tires, acceleration might entail operating continuously in time but under 
otherwise normal conditions. Alternatively, a sample of units could be subjected to 
stresses (e.g., temperature, vibration, voltage) substantially more severe than what 
is usually experienced. Acceleration can also be applied to entities in which 
degradation occurs over time—stiffness of springs, corrosion of metals, and wear- 
ing of mechanical components. In all these cases, probability models must then be 
developed to relate lifetime behavior under such acceleration to behavior in more 
customary situations. The article “Overview of Reliability Testing” (EEE 
Transactions on Reliability, 2012: 282-291) gives a survey of various testing 
methodologies and models. The article “A Methodology for Accelerated Testing 
by Mechanical Actuation of MEMS Devices” (Microelectronics Reliability, 2012: 
1382-1388) applies some of these ideas in the context of predicting lifetimes for 
micro-electro-mechanical systems. 

An important part of modern reliability engineering deals with building redun- 
dancy into various systems in order to decrease substantially the likelihood of 
failure. A k-out-of-n:G system works or is good only if at least k amongst the 
n constituent components work or are good, whereas a k-out-of-n:F system fails if 
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and only if at least k of the n components fail. The article “Redundancy Issues in 
Software and Hardware Systems: An Overview” (Jntl. Journal of Reliability, 
Quality, and Safety Engineering, 2011: 61-98) surveys these and various other 
systems that can improve the performance of computer software and hardware. The 
so-called triple modular redundant systems, with 2-out-of-3:G configuration, are 
now commonplace (e.g., Hewlett-Packard’s original NonStop server, and a variety 
of aero, auto, and rail systems). The article “Reliability of Various 2-Out-of-4:G 
Redundant Systems with Minimal Repair” (EEE Transactions on Reliability, 
2012: 170-179) considers using a Poisson process with time-varying rate function 
to model how component failures occur over time so that the rate of failure 
increases as a component ages; in addition, a component that fails undergoes repair 
so that it can be placed back in service. Several failure modes for combined k-out- 
of-n systems are studied in the article “Reliability of Combined m-Consecutive-k- 
out-of-n:F and Consecutive-k,-out-of-n:F Systems” (IEEE Transactions on Reli- 
ability, 2012: 215-219); these have applications in the areas of infrared detecting 
and signal processing. 

A compelling reason for manufacturers to be interested in reliability information 
about their products is that they can establish warranty policies and periods that 
help control costs. Many warranties are “one dimensional,” typically characterized 
by an interval of age (time). However, some warranties are “two dimensional” in 
that warranty conditions depend on both age and cumulative usage; these are 
common in the automotive industry. The article “Effect of Use-Rate on System 
Lifetime and Failure Models for 2D Warranty” (/ntl. Journal of Quality and 
Reliability Management, 2011: 464-482) describes how certain bivariate probabil- 
ity models for jointly describing the behavior of time and usage can be used to 
investigate the reliability of various system configurations. 

The word queue is used chiefly by the British to mean “waiting line,” i.e., a line 
of customers or other entities waiting to be served or brought into service. The 
mathematical development of models for how a waiting line expands and contracts 
as customers arrive at a service facility, enter service, and then finish began in 
earnest in the middle part of the 1900s and continues unabated today as new 
application scenarios are encountered. 

For example, the arrival and service of patients at some type of medical unit are 
often described by the notation M/M/s, where the first M signifies that arrivals occur 
according to a Poisson process, the second M indicates that the service time of each 
patient is governed by an exponential probability distribution, and there are 
s servers available for the patients. The article “Nurse Staffing in Medical Units: 
A Queueing Perspective” (Operations Research, 2011: 1320-1331) proposes an 
alternative closed queueing model in which there are s nurses within a single 
medical unit servicing n patients, where each patient alternates between requiring 
assistance and not needing assistance. The performance of the unit is characterized 
by the likelihood that delay in serving a patient needing assistance will exceed some 
critical threshold. A staffing rule based on the model and assumptions is developed; 
the resulting rule differs significantly from the fixed nurse-to-patient staffing ratios 
mandated by the state of California. 
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A variation on the medical unit situation just described occurs in the context of 
call centers, where effective management entails a trade-off between operational 
costs and the quality of service offered to customers. The article “Staffing Call 
Centers with Impatient Customers” (Operations Research, 2012: 461-474) 
considers an M/M/s queue in which customers who have to wait for service may 
become frustrated and abandon the facility (don’t you sometimes feel like doing 
that in a doctor’s office?). The behavior of such a system when n is large is 
investigated, with particular attention to the staffing principle that relates the 
number of servers to the square root of the workload offered to the call center. 

The methodology of queueing can also be applied to find optimal settings for 
traffic signals. The article “Delays at Signalized Intersections with Exhaustive 
Traffic Control” (Probability in Engineering and Informational Sciences, 2012: 
337-373) utilizes a “polling model,” which entails multiple queues of customers 
(corresponding to different traffic flows) served by a single server in cyclic order. 
The proposed vehicle-actuated rule is that traffic lights stay green until all lanes 
within a group are emptied. The mean traffic delay is studied for a variety of vehicle 
interarrival-time distributions in both light-traffic and heavy-traffic situations. 

Suppose two different types of customers, primary and secondary, arrive for 
service at a facility where the servers have different service rates. How should 
customers be assigned to the servers? The article “Managing Queues with Hetero- 
geneous Servers” (Journal of Applied Probability, 2011: 435-452) shows that the 
optimal policy for minimizing mean wait time has a “threshold structure”: for each 
server, there is a different threshold such that a primary customer will be assigned to 
that server if and only if the queue length of primary customers meets or exceeds the 
threshold. 


Applications to Finance 


The most explosive growth in the use of probability theory and methodology over 
the course of the last several decades has undoubtedly been in the area of finance. 
This has provided wonderful career opportunities for people with advanced degrees 
in statistics, mathematics, engineering, and physics (the son-in-law of one of the 
authors earned a Ph.D. in mechanical engineering and taught for several years, but 
then switched to finance). Edward O. Thorp, whom we previously met as the man 
who figured out how to beat blackjack, subsequently went on to success in finance, 
where he earned much more money managing hedge funds and giving advice than 
he could ever have hoped to earn in academia (those of us in academia love it for the 
intangible rewards we get—psychic income, if you will). 

One of the central results in mathematical finance is the Black-Scholes theorem, 
named after the two Nobel-prize-winning economists who discovered it. To get the 
flavor of what is involved here, a bit of background is needed. Suppose the present 
price of a stock is $20 per share, and it is known that at the end of 1 year, the price 
will either double to $40 or decrease to $10 per share (where those prices are 
expressed in current dollars, i.e., taking account of inflation over the 1-year period). 
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You can enter into an agreement, called an option contract, that allows you to 
purchase y shares of this stock (for any value y) 1 year from now for the amount cy 
(again in current dollars). In addition, right now you can buy x shares of the stock 
for 20x with the objective of possibly selling those shares 1 year from now. The 
values x and y are both allowed to be negative; if, for example, x were negative, then 
you would actually be selling shares of the stock now that you would have to 
purchase at either a cost of $40 per share or $10 per share 1 year from now. It can 
then be shown that there is only one value of c, specifically 50/3, for which the gain 
from this investment activity is 0 regardless of the choices of x and y and the value 
of the stock 1 year from now. If c is anything other than 50/3, then there is an 
arbitrage, an investment strategy involving choices of x and y that is guaranteed to 
result in a positive gain. 

A general result called the Arbitrage Theorem specifies conditions under which a 
collection of investments (or bets) has expected return 0 as opposed to there being 
an arbitrage strategy. The basis for the Black-Sholes theorem is that the variation in 
the price of an asset over time is described by a stochastic process called geometric 
Brownian motion (see Sect. 7.6). The theorem then specifies a fair price for an 
option contract on that asset so that no arbitrage is possible. 

Modern quantitative finance is very complex, and many of the basic ideas 
are unfamiliar to most novices (like the authors of this text!). It is therefore difficult 
to summarize the content of recently published articles as we have done for 
some other application areas. But a sampling of recently published titles 
emphasizes the role of probability modeling. Articles that appeared in the 2012 
Annals of Finance included “Option Pricing Under a Stressed Beta Model” and 
“Stochastic Volatility and Stochastic Leverage”; in the 2012 Applied Mathematical 
Finance, we found “Determination of Probability Distribution Measures from 
Market Prices Using the Method of Maximum Entropy in the Mean” and “On 
Cross-Currency Models with Stochastic Volatility and Correlated Interest Rates”; 
the 2012 Quantitative Finance yielded “Probability Unbiased Value-at-Risk 
Estimators” and “A Generalized Birth-Death Stochastic Model for High-Frequency 
Order Book Dynamics.” 

If the application of mathematics to problems in finance is of interest to 
you, there are now many excellent masters-level graduate programs in quantitative 
finance. Entrance to these programs typically requires a very solid background 
in undergraduate mathematics and statistics (including especially the course 
for which you are using this book). Be forewarned, though, that not all financially 
savvy individuals are impressed with the direction in which finance has recently 
moved. Former Federal Reserve Chairman Paul Volcker was quoted not long ago 
as saying that the ATM cash machine was the most significant financial innovation 
of the last 20 years; he has been a very vocal critic of the razzle-dazzle of modern 
finance. 
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Probability in Everyday Life 


In the hopefully unlikely event that you do not end up using probability concepts 
and methods in your professional life, you still need to face the fact that ideas 
surrounding uncertainty are pervasive in our world. We now present some amusing 
and intriguing examples to illustrate this. 

The behavioral psychologists Amos Tversky and Daniel Kahneman spent much 
of their academic careers carrying out studies to demonstrate that human beings 
frequently make logical errors when processing information about uncertainty 
(Kahneman won a Nobel prize in economics for his work, and Tversky would 
surely have also done so had the awards been given posthumously). Consider the 
following variant of one Tversky-Kahneman scenario. Which of the following two 
statements is more likely? 

(A) Dr. D is a former professor. 

(B) Dr. D is a former professor who was accused of inappropriate relations with 
some students, investigation substantiated the charges, and he was stripped of 
tenure. 

T-K’s research indicated that many people would regard statement B as being 
more likely, since it gives a more detailed explanation of why Dr. D is no longer a 
professor. However, this is incorrect. Statement B implies statement A. One of our 
basic probability rules will be that if one event B is contained in another event 
A (i.e., if B implies A), then the smaller event B is less likely to occur or have 
occurred than the larger event A. After all, other possible explanations for A are that 
Dr. D is deceased or that he is retired or that he deserted academia for investment 
banking—all of those plus B would figure in to the likelihood of A. 

The survey article “Judgment under Uncertainty; Heuristics and Biases” 
(Science, 1974: 1124-1131) by T-K described a certain town served by two 
hospitals. In the larger hospital about 45 babies are born each day, whereas about 
15 are born each day in the smaller one. About 50% of births are boys, but of course 
the percentage fluctuates from day to day. For a 1-year period, each hospital 
recorded days on which more than 60% of babies born were boys. Each of a number 
of individuals was then asked which of the following statements he/she thought was 
correct: (1) the larger hospital recorded more such days, (2) the smaller hospital 
recorded more such days, or (3) the number of such days was about the same for the 
two hospitals. Of the 95 participants, 21 chose (1), 21 chose (2), and 53 chose (3). In 
Chap. 5 we present a general result which implies that the correct answer is in fact 
(2), because the sample percentage is less likely to stray from the true percentage 
(in this case about 50%) when the sample size is larger rather than small. 

In case you think that mistakes of this sort are made only by those who are 
unsophisticated or uneducated, here is yet another T-K scenario. Each of a sample 
of 80 physicians was presented with the following information on treatment for a 
particular disease: 


With surgery, 10% will die during treatment, 32% will die within a year, 66% will die 
within 5 years. With radiation, 0% will die during treatment, 23% will die within a year, 
78% will die within 5 years. 
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Each of the 87 physicians in a second sample was presented with the following 
information: 


With surgery, 90% will survive the treatment, 68% will survive at least 1 year, and 34% will 
survive at least 5 years. With radiation, 100% will survive the treatment, 77% will survive 
at least 1 year, and 22% will survive at least 5 years. 


When each physician was asked to indicate whether he/she would recommend 
surgery or radiation based on the supplied information, 50% of those in the first 
group said surgery whereas 84% of those in the second group said surgery. 

The distressing thing about this conclusion is that the information provided to 
the first group of physicians is identical to that provided to the second group, but 
described in a slightly different way. If the physicians were really processing 
information rationally, there should be no significant difference between the two 
percentages. 

It would be hard to find a book containing even a brief exposition of probability 
that did not contain examples or exercises involving coin tossing. Many such 
scenarios involve tossing a “fair” coin, one that is equally likely to result in 
H (head side up) or T (tail side up) on any particular toss. Are real coins actually 
fair, or is there a bias of some sort? Various analyses have shown that the result of a 
coin toss is predicable at least to some degree if initial conditions (position, 
velocity, angular momentum) are known. In practice, most people who toss coins 
(e.g., referees in a football game trying to determine which team will kick off and 
which will receive) are not conversant in the physics of coin tossing. The mathe- 
matician and statistician Persi Diaconis, who was a professional magician for 
10 years prior to earning his Ph.D. and mastered many coin and card tricks, has 
engaged in ongoing collaboration with other researchers to study coin tossing. One 
result of these investigations was the conclusion based on physics that for a caught 
coin, there is a slight bias toward heads—about .51 versus .49. It is not, however, 
clear under which real-world circumstances this or some other bias will occur. 

Simulation of fair-coin tossing can be done using a random number generator 
available in many software packages (about which we’ll say more shortly). If the 
resulting random number is between 0 and .5, we say that the outcome of the toss 
was H, and if the number is between .5 and 1, then a T occurred (there is an obvious 
modification of this to incorporate bias). Now consider the following sequence of 
200 Hs and Ts: 


THTHTTTHTTTTTHTHTTTHTTHHHTHHTHTHTHTTTTHHTTHHTTHHHT 
HHHTTHHHTTTHHHTHHHHTTTHTHTHHHHTHTTTHHHTHHTHTTTHHTH 
HHTHHHHTTHTHHTHHHTTTHTHHHTHHTTTHHHTTTTHHHTHTHHHHTH 
TTHHTTTTHTHTHTTHTHHTTHTTTHTTTTHHHHTHTHHHTTHHHHATHH 


Did this sequence result from actually tossing a fair coin (equivalently, using 
computer simulation as described), or did it come from someone who was asked to 
write down a sequence of 200 Hs and Ts that he/she thought would come from 
tossing a fair coin? One way to address this question is to focus on the longest run of 
Hs in the sequence of tosses. This run is of length 4 for the foregoing sequence. 
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Probability theory tells us that the expected length of the longest run in a sequence 
of n fair-coin tosses is approximately log>(n) — 2/3. For n = 200, this formula gives 
an expected longest run of length about 7. It can also be shown that there is less than 
a 10% chance that the longest run will have a length of 4 or less. This suggests that 
the given sequence is fictitious rather than real, as in fact was the case; see the very 
nice expository article “The Longest Run of Heads” (Mathematics Magazine, 1990, 
196-207). 

As another example, consider giving a fair coin to each of the two authors of this 
textbook. Carlton tosses his coin repeatedly until obtaining the sequence HTT. 
Devore tosses his coin repeatedly until the sequence HTH is observed. Is Carlton’s 
expected number of tosses to obtain his desired sequence the same as Devore’s, or is 
one expected number of tosses smaller than the other? Most students to whom we 
have asked these questions initially answered that the two expected numbers should 
be the same. But this is not true. Some rather tricky probability arguments can be 
used to show that Carlton’s expected number of tosses is eight, whereas Devore 
expects to have to make ten tosses. Very surprising, no? A bit of intuition makes 
this more plausible. Suppose Carlton merrily tosses away until at some point he has 
just gotten HT. So he is very excited, thinking that just one more toss will enable 
him to stop tossing the coin and move on to some more interesting pursuit. 
Unfortunately his hopes are dashed because the next toss is an H. However, all is 
not lost, as even though he must continue tossing, at this point he is partway toward 
reaching his goal of HTT. If Devore sees HT at some point and gets excited by light 
at the end of the tunnel but then is crushed by the appearance of a T rather than an H, 
he essentially has to start over again from scratch. The charming nontechnical book 
Probabilities: The Little Numbers That Rule Our Lives by Peter Olofsson has more 
detail on this and other probability conundrums. 

One of the all-time classic probability puzzles that stump most people is called 
the Birthday Problem. Consider a group of individuals, all of whom were born in 
the same year (one that did not have a February 29). If the group size is 400, how 
likely is it that at least two members of the group share the same birthday? 
Hopefully a moment’s reflection will bring you to the realization that a shared 
birthday here is a sure thing (100% chance), since there are only 365 possible 
birthdays for the 400 people. On the other hand, it is intuitively quite unlikely that 
there will be a shared birthday if the group size is only five; in this case we would 
expect that all five individuals would have different birthdays. 

Clearly as the group size increases, it becomes more likely that two or more 
individuals will have the same birthday. So how large does the group size have to be 
in order for it to be more likely than not that at least two people share a birthday 
(i.e., that the likelihood of a shared birthday is more than 50%)? Which one of the 
following four group-size categories do you believe contains the correct answer to 
this question? 


(1) At least 100 (2) At least 50 but less than 100 
(3) At least 25 but less than 50 (4) Fewer than 25 


XXX Introduction: Why Study Probability? 


When we have asked this of students in our classes, a substantial majority opted 
for the first two categories. Very surprisingly, the correct answer is category (4). 
In Chapter | we will show that with as few as 23 people in the group, it is a bit more 
likely than not that at least two group members will have the same birthday. 

Two people having the same birthday implies that they were born within 24 h of 
one another, but the converse is not true; e.g., one person might be born just before 
midnight on a particular day and another person just after midnight on the next day. 
This implies that it is more likely that two people will have been born within 24 h of 
one another than it is that they have the same birthday. It follows that a smaller 
group size than 23 is needed to make it more likely than not that at least two people 
will have been born within 24 h of one another. In Sect. 4.9 we show how this group 
size can be determined. 

Two people in a group having the same birthday is an example of a coincidence, 
an accidental and seemingly surprising occurrence of events. The fact that even for 
a relatively small group size it is more likely than not that this coincidence will 
occur should suggest that coincidences are often not as surprising as they might 
seem. This is because even if a particular coincidence (e.g., “graduated from the 
same high school” or “visited the same small town in Croatia”) is quite unlikely, 
there are so many opportunities for coincidences that quite a few are sure to occur. 

Back to the follies of misunderstanding medical information: Suppose the 
incidence rate of a particular disease in a certain population is 1 in 1000. The 
presence of the disease cannot be detected visually, but a diagnostic test is avail- 
able. The diagnostic test correctly detects 98% of all diseased individuals (this is the 
sensitivity of the test, its ability to detect the presence of the disease), and 93% of 
non-diseased individuals test negative for the disease (this is the specificity of the 
test, an indicator of how specific the test is to the disease under consideration). 
Suppose a single individual randomly selected from the population is given the test 
and the test result is positive. In light of this information, how likely is it that the 
individual will have the disease? 

First note that if the sensitivity and the specificity were both 100%, then it would 
be a sure thing that the selected individual has the disease. The reason this is not a 
sure thing is that the test sometimes makes mistakes. Which one of the following 
five categories contains the actual likelihood of having the disease under the 
described conditions? 

1. At least a 75% chance (quite likely) 

. At least 50% but less than 75% (moderately likely) 
. At least 25% but less than 50% (somewhat likely) 
. At least 10% but less than 25% (rather unlikely) 

. Less than 10% (quite unlikely) 

Student responses to this question have overwhelmingly been in categories (1) or 
(2)—another case of intuition going awry. The correct answer turns out to be 
category (5). In fact, even in light of the positive test result, there is still only a 
bit more than a 1% chance that the individual is diseased! 

What is the explanation for this counterintuitive result? Suppose we start with 
100,000 individuals from the population. Then we’d expect 100 of those, or 100, to 
be diseased (from the 1 in 1000 incidence rate) and 99,900 to be disease free. From 
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the 100 we expect to be diseased, we’d expect 98 positive test results (98% 
sensitivity). And from the 99,900 we expect to be disease free, we’d expect 7% 
of those or 6993 to yield positive test results. Thus we expect many more false 
positives than true positives. This is because the disease is quite rare and the 
diagnostic test is rather good but not stunningly so. (In case you think our sensitivity 
and specificity are low, consider a certain D-dimer test for the presence of a 
coronary embolism; its sensitivity and specificity are 88% and 75%, respectively.) 

Later in Chapter 1 (Example 1.31) we develop probability rules which can be 
used to show that the posterior probability of having the disease conditional on a 
positive test result is .0138—a bit over 1%. This should make you very cautious 
about interpreting the results of diagnostic tests. Before you panic in light of a 
positive test result, you need to know the incidence rate for the condition under 
consideration and both the sensitivity and specificity of the test. There are also 
implications for situations involving detection of something other than a disease. 
Consider airport procedures that are used to detect the presence of a terrorist. What 
do you think is the incidence rate of terrorists at a given airport, and how sensitive 
and specific do you think detection procedures are? The overwhelming number of 
positive test results will be false, greatly inconveniencing those who test positive! 

Here’s one final example of probability applied in everyday life: One of the 
following columns contains the value of the closing stock index as of August 
8, 2012, for each of a number of countries, and the other column contains fake 
data obtained with a random number generator. Just by looking at the numbers, 
without considering context, can you tell which column is fake and which is real? 


China 2264 3058 
Japan 8881 9546 
Britain 5846 7140 
Canada 11,781 6519 
Euro area 7197 S11 
Austria 2053 4995 
France 3438 2097 
Germany 6966 4628 
Italy 14,665 8461 
Spain 722 598 
Norway 480 1133 
Russia 1445 4100 
Sweden 1080 2594 
Turkey 64,699 35,027 
Hong Kong 20,066 42,182 
India 17,601 3388 
Pakistan 14,744 10,076 
Singapore 3052 227i 
Thailand 1214 7460 


Argentina 2459 2159 
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The key to answering this question is a result called Benford’ s Law. Suppose you 
start reading through a particular issue of a publication like the New York Times or 
The Economist, and each time you encounter any number (the amount of donations 
to a particular political candidate, the age of an actor, the number of members of a 
union, and so on), you record the first digit of that number. Possible first digits are 
1, 2, 3,..., or 9. In the long run, how frequently do you think each of these nine 
possible first digits will be encountered? Your first thought might be that each one 
should have the same long-run frequency, 1/9 (roughly 11%). But for many sets of 
numbers this turns out not to be the case. Instead the long-run frequency is given by 
the formula logio[(x + 1)/x], which gives .301, .176, .125, ..., .051, .046 for x= 1, 
2,3,...,8, 9. Thus a leading digit is much more likely to be 1, 2, or 3 than 7, 8, or 9. 

Examination of the foregoing lists of numbers shows that the first column 
conforms much more closely to Benford’s Law than does the second column. 
In fact, the first column is real, whereas the second one is fake. For Benford’s 
Law to be valid, it is generally required that the set of numbers under consideration 
span several orders of magnitude. It does not work, for example, with batting 
averages of major league baseball players, most of which are between .200 and 
.299, or with fuel efficiency ratings (miles per gallon) for automobiles, most of 
which are currently between 15 and 30. Benford’s Law has been employed to detect 
fraud in accounting reports, and in particular to detect fraudulent tax returns. 
So beware when you file your taxes next year! 

This list of amusing probability appetizers could be continued for quite a while. 
Hopefully what we have shown thus far has sparked your interest in knowing more 
about the discipline. So without further ado... . 


Probability 


Probability is the subdiscipline of mathematics that focuses on a systematic study 
of randomness and uncertainty. In any situation in which one of a number of 
possible outcomes may occur, the theory of probability provides methods for 
quantifying the chances, or likelihoods, associated with the various outcomes. 
The language of probability is constantly used in an informal manner in both 
written and spoken contexts. Examples include such statements as “It is likely 
that the Dow Jones Industrial Average will increase by the end of the year,” “There 
is a 50-50 chance that the incumbent will seek reelection,” “There will probably be 
at least one section of that course offered next year,” “The odds favor a quick 
settlement of the strike,” and “It is expected that at least 20,000 concert tickets will 
be sold.” In this chapter, we introduce some elementary probability concepts, 
indicate how probabilities can be interpreted, and show how the rules of probability 
can be applied to compute the chances of many interesting events. The methodol- 
ogy of probability will then permit us to express in precise language such informal 
statements as those given above. 


1.1 Sample Spaces and Events 


In probability, an experiment refers to any action or activity whose outcome is 
subject to uncertainty. Although the word experiment generally suggests a planned 
or carefully controlled laboratory testing situation, we use it here in a much wider 
sense. Thus experiments that may be of interest include tossing a coin once or 
several times, selecting a card or cards from a deck, weighing a loaf of bread, 
measuring the commute time from home to work on a particular morning, deter- 
mining blood types from a group of individuals, or calling people to conduct a 
survey. 


M.A. Carlton and J.L. Devore, Probability with Applications in Engineering, Science, and 1 
Technology, Springer Texts in Statistics, DOI 10.1007/978-1-4939-0395-5_1, 
© Springer Science+Business Media New York 2014 
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1.1.1 The Sample Space of an Experiment 


DEFINITION 
The sample space of an experiment, denoted by & is the set of all possible 
outcomes of that experiment. 


Example 1.1 The simplest experiment to which probability applies is one with two 
possible outcomes. One such experiment consists of examining a single fuse to see 
whether it is defective. The sample space for this experiment can be abbreviated as 
§= {N, D}, where N represents not defective, D represents defective, and the braces 
are used to enclose the elements of a set. Another such experiment would involve 
tossing a thumbtack and noting whether it landed point up or point down, with 
sample space ’= {U, D}, and yet another would consist of observing the gender of 
the next child born at the local hospital, with f= {M, F}. | 


Example 1.2 If we examine three fuses in sequence and note the result of each 
examination, then an outcome for the entire experiment is any sequence of Ns and 
Ds of length 3, so 


§={NNN, NND, NDN, NDD, DNN, DND, DDN, DDD} 


If we had tossed a thumbtack three times, the sample space would be obtained by 
replacing N by U in £ above. A similar notational change would yield the sample 
space for the experiment in which the genders of three newborn children are 
observed. a 


Example 1.3 Two gas stations are located at a certain intersection. Each one has 
six gas pumps. Consider the experiment in which the number of pumps in use at a 
particular time of day is observed for each of the stations. An experimental outcome 
specifies how many pumps are in use at the first station and how many are in use at 
the second one. One possible outcome is (2, 2), another is (4, 1), and yet another is 
(1, 4). The 49 outcomes in S are displayed in the accompanying table. 


Second station 


First 
station 0 1 2 3 4 3) 6 
0 (0, 0) (0, 1) (0, 2) (0, 3) (0, 4) (0, 5) (0, 6) 
1 (1, 0) d, 1) (, 2) (1, 3) d, 4) d, 5) (1, 6) 
2) (2, 0) (2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6) 
3 (3, 0) (3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6) 
4 (4, 0) (4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6) 
5 (5, 0) (5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6) 
6 (6, 0) (6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6) 
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The sample space for the experiment in which a six-sided die is thrown twice results 
from deleting the 0 row and 0 column from the table, giving 36 outcomes. a 


Example 1.4 A reasonably large percentage of C++ programs written at a particu- 
lar company compile on the first run, but some do not. Suppose an experiment 
consists of selecting and compiling C++ programs at this location until encounter- 
ing a program that compiles on the first run. Denote a program that compiles on the 
first run by S (for success) and one that doesn’t do so by F (for failure). Although it 
may not be very likely, a possible outcome of this experiment is that the first 
5 (or 10 or 20 or .. .) are F's and the next one is an S. That is, for any positive integer 
n we may have to examine n programs before seeing the first S. The sample space is 
§={S, FS, FFS, FFFS, ...}, which contains an infinite number of possible 
outcomes. The same abbreviated form of the sample space is appropriate for an 
experiment in which, starting at a specified time, the gender of each newborn infant 
is recorded until the birth of a female is observed. a 


1.1.2 Events 


In our study of probability, we will be interested not only in the individual outcomes 
of £ but also in any collection of outcomes from S. 


DEFINITION 

An event is any collection (subset) of outcomes contained in the sample space 
§. An event is said to be simple if it consists of exactly one outcome and 
compound if it consists of more than one outcome. 


When an experiment is performed, a particular event A is said to occur if the 
resulting experimental outcome is contained in A. In general, exactly one simple 
event will occur, but many compound events will occur simultaneously. 


Example 1.5 Consider an experiment in which each of three vehicles taking a 
particular freeway exit turns left (L) or right (R) at the end of the off-ramp. The 
eight possible outcomes that comprise the sample space are LLL, RLL, LRL, LLR, 
LRR, RLR, RRL, and RRR. Thus there are eight simple events, among which are 
E, = {LLL} and Es = {LRR}. Some compound events include 


{RLL, LRL, LLR} = the event that exactly one of the three vehicles turns right 
{LLL, RLL, LRL, LLR} = the event that at most one of the vehicles turns right 
= {LLL, RRR} =the event that all three vehicles turn in the same direction 


A 
B 
Cc 
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Suppose that when the experiment is performed, the outcome is LLL. Then the 
simple event EF has occurred and so also have the events B and C (but not A). 


Example 1.6 (Example 1.3 continued) When the number of pumps in use at each 
of two six-pump gas stations is observed, there are 49 possible outcomes, so there 
are 49 simple events: E; = {(0, 0)}, F2= {(0, 1)}, ... , E49 = {(6, 6)}. Examples of 
compound events are 


A= {(0, 0), C1, 1), (2, 2), G, 3), (4, 4), (5, 5), (6, 6)} = the event that the number of 
pumps in use is the same for both stations 
B= {(0, 4), C1, 3), (2, 2), (3, 1), (4, 0)} =the event that the total number of pumps in 


use is four 
C= {(0, 0), (0, 1), C1, 0), 1, 1)} = the event that at most one pump is in use at each 
station a 


Example 1.7 (Example 1.4 continued) The sample space for the program compi- 
lation experiment contains an infinite number of outcomes, so there are an infinite 
number of simple events. Compound events include 


A= {S, FS, FFS} =the event that at most three programs are examined 

B={S, FFS, FFFFS}=the event that exactly one, three, or five programs 
are examined 

C= {FS, FFFS, FFFFFS, ...} =the event that an even number of programs are 
examined a 


1.1.3. Some Relations from Set Theory 


An event is nothing but a set, so relationships and results from elementary set theory 
can be used to study events. The following operations will be used to construct new 
events from given events. 


DEFINITION 


1. The complement of an event A, denoted by 4’, is the set of all outcomes in 
§ that are not contained in A. 

2. The intersection of two events A and B, denoted by AM B and read “A and 
B,” is the event consisting of all outcomes that are in both A and B. 

3. The union of two events A and B, denoted by A UB and read “A or B,” is 
the event consisting of all outcomes that are either in A or in B or in both 
events (so that the union includes outcomes for which both A and B occur 
as well as outcomes for which exactly one occurs)—that is, all outcomes in 
at least one of the events. 
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Example 1.8 (Example 1.3 continued) For the experiment in which the number of 
pumps in use at a single six-pump gas station is observed, let A = {0, 1, 2,3,4}, B= 
{3, 4,5, 6}, and C= {1, 3, 5}. Then 


AUB = {0,1,2,3,4,5,6} =£ AUC = {0,1,2,3,4,5} 

ANB = {3,4} ANC={1,3} A’ ={5,6} (AUC) = {6} 2 
Example 1.9 (Example 1.4 continued) In the program compilation experiment, 
define A, B, and C as in Example 1.7. Then 
AUB=(S, FS, FFS, FFFFS} 


ANB={S, FFS} 
A’ ={FFFS, FFFFS, FFFFFS, ...} 


and 
C’={S, FFS, FFFFS, ...}=the event that an odd number of programs are 
examined a 


The complement, intersection, and union operators from set theory correspond to 
the not, and, and or operators from computer science. Readers with prior program- 
ming experience may be aware of an important relationship between these three 
operators, first discovered by the nineteenth-century British mathematician August- 
us De Morgan. 


DE MORGAN'S LAWS 
Let A and B be two events in the sample space of some experiment. Then 
1. (AUB) =A'NB' 
2. (ANB)! =A’ UB’ 


De Morgan’s laws state that the complement of a union is an intersection of 
complements, and the complement of an intersection is a union of complements. 

Sometimes A and B have no outcomes in common, so that the intersection of 
A and B contains no outcomes (see Exercise 11). 


DEFINITION 

When A and B have no outcomes in common, they are said to be disjoint or 
mutually exclusive events. Mathematicians write this compactly as 
ANB=©, where © denotes the event consisting of no outcomes whatsoever 
(the “null” or “empty” event). 
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Example 1.10 A small city has three automobile dealerships: a GM dealer selling 
Chevrolets and Buicks; a Ford dealer selling Fords and Lincolns; and a Chrysler 
dealer selling Jeeps and Chryslers. If an experiment consists of observing the brand 
of the next car sold, then the events A={Chevrolet, Buick} and B= {Ford, 
Lincoln} are mutually exclusive because the next car sold cannot be both a GM 
product and a Ford product. a 


A pictorial representation of events and manipulations with events is obtained by 
using Venn diagrams. To construct a Venn diagram, draw a rectangle whose 
interior will represent the sample space S. Then any event A is represented as the 
interior of a closed curve (often a circle) contained in & Figure 1.1 shows examples 
of Venn diagrams. 


a b c d e 


ae “6 Dy ‘@ (4) 2 C* 


S S S & 


Fig. 1.1. Venn diagrams. (a) Venn diagram of events A and B (b) Shaded region is ANB (ce) 
Shaded region is A UB (d) Shaded region is A’ (e) Mutually exclusive events 


The operations of union and intersection can be extended to more than two 
events. For any three events A, B, and C, the event AM BMC is the set of outcomes 
contained in al/ three events, whereas A UB U C is the set of outcomes contained in 
at least one of the three events. A collection of several events is said to be mutually 
exclusive (or pairwise disjoint) if no two events have any outcomes in common. 


1.1.4 Exercises: Section 1.1 (1-12) 


1. Ann and Bev have each applied for several jobs at a local university. Let A be 
the event that Ann is hired and let B be the event that Bev is hired. Express in 
terms of A and B the events 
(a) Ann is hired but not Bev. 

(b) At least one of them is hired. 
(c) Exactly one of them is hired. 

2. Two voters, Al and Bill, are each choosing between one of three candidates—1, 
2, and 3—who are running for city council. An experimental outcome specifies 
both Al’s choice and Bill’s choice, e.g., the pair (3,2). 

(a) List all elements of S. 
(b) List all outcomes in the event A that Al and Bill make the same choice. 
(c) List all outcomes in the event B that neither of them votes for candidate 2. 

3. Four universities—1, 2, 3, and 4—are participating in a holiday basketball 
tournament. In the first round, 1 will play 2 and 3 will play 4. Then the two 
winners will play for the championship, and the two losers will also play. One 
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possible outcome can be denoted by 1324: 1 beats 2 and 3 beats 4 in first-round 

games, and then 1 beats 3 and 2 beats 4. 

(a) List all outcomes in &. 

(b) Let A denote the event that 1 wins the tournament. List outcomes in A. 

(c) Let B denote the event that 2 gets into the championship game. List 
outcomes in B. 

(d) What are the outcomes in A UB and in AB? What are the outcomes in A’? 


. Suppose that vehicles taking a particular freeway exit can turn right (R), turn 


left (L), or go straight (S$). Consider observing the direction for each of three 

successive vehicles. 

(a) List all outcomes in the event A that all three vehicles go in the same 
direction. 

(b) List all outcomes in the event B that all three vehicles take different 
directions. 

(c) List all outcomes in the event C that exactly two of the three vehicles turn 
right. 

(d) List all outcomes in the event D that exactly two vehicles go in the same 
direction. 

(e) List the outcomes in D’, CUD, and COD. 


. Three components are connected to form a system as shown in the 


accompanying diagram. Because the components in the 2—3 subsystem are 
connected in parallel, that subsystem will function if at least one of the two 
individual components functions. For the entire system to function, component 
1 must function and so must the 2—3 subsystem. 


2 
—<a> 
3 


The experiment consists of determining the condition of each component: 
S (success) for a functioning component and F (failure) for a nonfunctioning 
component. 
(a) What outcomes are contained in the event A that exactly two out of the 
three components function? 
(b) What outcomes are contained in the event B that at least two of the 
components function? 
(c) What outcomes are contained in the event C that the system functions? 
(d) List outcomes in C’, AUC, ANC, BUC, and BNC. 


. Each of a sample of four home mortgages is classified as fixed rate (F) or 


variable rate (V). 

(a) What are the 16 outcomes in S? 

(b) Which outcomes are in the event that exactly three of the selected 
mortgages are fixed rate? 
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(c) Which outcomes are in the event that all four mortgages are of the same 
type? 

(d) Which outcomes are in the event that at most one of the four is a variable- 
rate mortgage? 

(e) What is the union of the events in parts (c) and (d), and what is the 
intersection of these two events? 

(f) What are the union and intersection of the two events in parts (b) and (c)? 


. A family consisting of three persons—A, B, and C—belongs to a medical clinic 


that always has a doctor at each of stations 1, 2, and 3. During a certain week, 
each member of the family visits the clinic once and is assigned at random to a 
station. The experiment consists of recording the station number for each 
member. One outcome is (1, 2, 1) for A to station 1, B to station 2, and C to 
station 1. 

(a) List the 27 outcomes in the sample space. 

(b) List all outcomes in the event that all three members go to the same station. 
(c) List all outcomes in the event that all members go to different stations. 
(d) List all outcomes in the event that no one goes to station 2. 


. Acollege library has five copies of a certain text on reserve. Two copies (1 and 


2) are first printings, and the other three (3, 4, and 5) are second printings. A 
student examines these books in random order, stopping only when a second 
printing has been selected. One possible outcome is 5, and another is 213. 

(a) List the outcomes in S. 

(b) Let A denote the event that exactly one book must be examined. What 

outcomes are in A? 

(c) Let B be the event that book 5 is the one selected. What outcomes are in B? 
(d) Let C be the event that book | is not examined. What outcomes are in C? 


. An academic department has just completed voting by secret ballot for a 


department head. The ballot box contains four slips with votes for candidate 

A and three slips with votes for candidate B. Suppose these slips are removed 

from the box one by one. 

(a) List all possible outcomes. 

(b) Suppose a running tally is kept as slips are removed. For what outcomes 
does A remain ahead of B throughout the tally? 

A construction firm is currently working on three different buildings. Let A; 

denote the event that the ith building is completed by the contract date. Use the 

operations of union, intersection, and complementation to describe each of the 

following events in terms of A;, Az, and A3, draw a Venn diagram, and shade 

the region corresponding to each one. 

(a) At least one building is completed by the contract date. 

(b) All buildings are completed by the contract date. 

(c) Only the first building is completed by the contract date. 

(d) Exactly one building is completed by the contract date. 

(e) Either the first building or both of the other two buildings are completed by 
the contract date. 
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11. Use Venn diagrams to verify De Morgan’s laws: 

(a) (AUBY =A'NB' 

(b) (ANB)! =A’ UB' 

12. (a) In Example 1.10, identify three events that are mutually exclusive. 

(b) Suppose there is no outcome common to all three of the events A, B, and C. 
Are these three events necessarily mutually exclusive? If your answer is 
yes, explain why; if your answer is no, give a counterexample using the 
experiment of Example 1.10. 


1.2 Axioms, Interpretations, and Properties of Probability 


Given an experiment and its sample space ¥, the objective of probability is to assign 
to each event A a number P(A), called the probability of the event A, which will 
give a precise measure of the chance that A will occur. To ensure that the probabil- 
ity assignments will be consistent with our intuitive notions of probability, all 
assignments should satisfy the following axioms (basic properties) of probability. 


AXIOM 1 
For any event A, P(A) > 0. 


AXIOM 2 
P(8)=1. 


AXIOM 3 
If A,, Ao, A3, ... iS an infinite collection of disjoint events, then 


P(A, UApUA3 U---) = 5 P(Ai) 


Axiom | reflects the intuitive notion that the chance of A occurring should be 
nonnegative. The sample space is by definition the event that must occur when the 
experiment is performed (£ contains all possible outcomes), so Axiom 2 says that 
the maximum possible probability of 1 is assigned to ¥. The third axiom formalizes 
the idea that if we wish the probability that at least one of a number of events will 
occur and no two of the events can occur simultaneously, then the chance of at least 
one occurring is the sum of the chances of the individual events. 


10 1 Probability 


You might wonder why the third axiom contains no reference to a finite collection 
of disjoint events. It is because the corresponding property for a finite collection can 
be derived from our three axioms. We want our axiom list to be as short as possible 
and not contain any property that can be derived from others on the list. 


PROPOSITION 
P(®)=0, where © is the null event. This, in turn, implies that the property 
contained in Axiom 3 is valid for a finite collection of events. 


Proof First consider the infinite collection A; =@, A,>=@, Az= ©, .... Since 
ONO =, the events in this collection are disjoint and UA;= ©. Axiom 3 then 
gives 


P(@) =} P(@) 


This can happen only if P(®)=0. 

Now suppose that A;, Az, ..., A, are disjoint events, and append to these the 
infinite collection Ay; =O, Aga = O, Ag3 = O, .... Then the events Aj, Ao, ..., 
Ay Agsts--- are disjoint, since AN © = © for all events. Again invoking Axiom 3, 


i=1 i=1 i=k+1 
k oo k 
=) P(A) + 55 0= 5 P(A) 
i=l i=k+1 i=l 
as desired. | 


Example 1.11 Consider tossing a thumbtack in the air. When it comes to rest on 
the ground, either its point will be up (the outcome U) or down (the outcome D). 
The sample space for this event is therefore = {U, D}. The axioms specify P(S) = 
1, so the probability assignment will be completed by determining P(U) and P(D). 
Since U and D are disjoint and their union is ¥, the foregoing proposition implies 
that 


1 = P(8) = P(U) + P(D) 


It follows that P(D)=1—P(U). One possible assignment of probabilities is 
P(U)=.5, P(D)=.5, whereas another possible assignment is P(U)=.75, 
P(D)=.25. In fact, letting p represent any fixed number between 0 and 
1, PU) =p, P(D) = 1 —p is an assignment consistent with the axioms. | 
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Example 1.12 Consider testing batteries coming off an assembly line one by one 
until a battery having a voltage within prescribed limits is found. The simple events 
are FE; = {S}, F.= {FS}, F3= {FFS}, E4= {FFFS},.... Suppose the probability of 
any particular battery being satisfactory is .99. Then it can be shown that the 
probability assignment P(E\)=.99, P(E2)=(.01)(.99), P(E3)= (.01)°(.99), ae 
satisfies the axioms. In particular, because the £;s are disjoint and 
§=E,VUE,UE3U..., Axioms 2 and 3 require that 


1 = P(S) = P(E\) + P(E>) + P(E3) + 


= 99/1 + 01+ (.01)° + (01) +-- 


This can be verified using the formula for the sum of a geometric series: 


atar+ar+ar4+-.-= i 
pr 
However, another legitimate (according to the axioms) probability assignment 
of the same “geometric” type is obtained by replacing .99 by any other number 
p between 0 and 1 (and .01 by 1 —p). a 


1.2.1 Interpreting Probability 


Examples 1.11 and 1.12 show that the axioms do not completely determine an 
assignment of probabilities to events. The axioms serve only to rule out 
assignments inconsistent with our intuitive notions of probability. In the tack- 
tossing experiment of Example 1.11, two particular assignments were suggested. 
The appropriate or correct assignment depends on the nature of the thumbtack and 
also on one’s interpretation of probability. The interpretation that is most often used 
and most easily understood is based on the notion of relative frequencies. 

Consider an experiment that can be repeatedly performed in an identical and 
independent fashion, and let A be an event consisting of a fixed set of outcomes of 
the experiment. Simple examples of such repeatable experiments include the tack- 
tossing and die-tossing experiments previously discussed. If the experiment is 
performed n times, on some of the replications the event A will occur (the outcome 
will be in the set A), and on others, A will not occur. Let n(A) denote the number of 
replications on which A does occur. Then the ratio n(A)/n is called the relative 
frequency of occurrence of the event A in the sequence of n replications. 

For example, let A be the event that a package sent within the state of California 
for 2-day delivery actually arrives within 1 day. The results from sending ten such 
packages (the first ten replications) are as follows. 


Package # 1 2 3 4 5 6 7 8 9 10 
Did N Y Y Y N N Y Y N N 
A occur? 

Relative 0 es) .667 J) 6 5 sill | 4823) | sete 2S) 
frequency 


of A 
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Fig. 1.2 Behavior of relative frequency: (a) Initial fluctuation (b) Long-run stabilization 


Figure 1.2a shows how the relative frequency n(A)/n fluctuates rather substan- 
tially over the course of the first 50 replications. But as the number of replications 
continues to increase, Fig. 1.2b illustrates how the relative frequency stabilizes. 

More generally, empirical evidence, based on the results of many such repeat- 
able experiments, indicates that any relative frequency of this sort will stabilize as 
the number of replications n increases. That is, as n gets arbitrarily large, n(A)/n 
approaches a limiting value we refer to as the /imiting (or long-run) relative 
frequency of the event A. The objective interpretation of probability identifies 
this limiting relative frequency with P(A). A formal justification of this interpreta- 
tion is provided by the Law of Large Numbers, a theorem we’ll encounter in 
Chap. 4. 

Suppose that probabilities are assigned to events in accordance with their 
limiting relative frequencies. Then a statement such as “the probability of a package 
being delivered within 1 day of mailing is .6” means that of a large number of 
mailed packages, roughly 60% will arrive within 1 day. Similarly, if B is the event 
that a certain brand of dishwasher will need service while under warranty, then 
P(B) =.1 is interpreted to mean that in the long run 10% of such dishwashers will 
need warranty service. This does not mean that exactly 1 out of 10 will need service, 
or exactly 20 out of 200 will need service, because 10 and 200 are not the long run. 
Such mis-interpretations of probability as a guarantee on short-term outcomes are at 
the heart of the infamous gambler’ s fallacy. 

This relative frequency interpretation of probability is said to be objective 
because it rests on a property of the experiment rather than on any particular 
individual concerned with the experiment. For example, two different observers 
of a sequence of coin tosses should both use the same probability assignments since 
the observers have nothing to do with limiting relative frequency. 

In practice, this interpretation is not as objective as it might seem, because the 
limiting relative frequency of an event will not be known. Thus we will have to 
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assign probabilities based on our beliefs about the limiting relative frequency of 
events under study. Fortunately, there are many experiments for which there will be 
a consensus with respect to probability assignments. When we speak of a fair coin, 
we shall mean P(H) = P(T) =.5, and a fair die is one for which limiting relative 
frequencies of the six outcomes are all equal, suggesting probability assignments 
P(Q)=---=P(®)=1/6. 

Because the objective interpretation of probability is based on the notion of 
limiting frequency, its applicability is limited to experimental situations that are 
repeatable. Yet the language of probability is often used in connection with 
situations that are inherently unrepeatable. Examples include: “The chances are 
good for a peace agreement”; “It is likely that our company will be awarded the 
contract”; and “Because their best quarterback is injured, I expect them to score no 
more than 10 points against us.” In such situations we would like, as before, to 
assign numerical probabilities to various outcomes and events (e.g., the probability 
is .9 that we will get the contract). We must therefore adopt an alternative interpre- 
tation of these probabilities. Because different observers may have different prior 
information and opinions concerning such experimental situations, probability 
assignments may now differ from individual to individual. Interpretations in such 
situations are thus referred to as subjective. The book by Winkler listed in the 
references gives a very readable survey of several subjective interpretations. 
Importantly, even subjective interpretations of probability must satisfy the three 
axioms (and all properties that follow from the axioms) in order to be valid. 


1.2.2. More Probability Properties 


COMPLEMENT RULE 
For any event A, P(A) = 1 — P(A’). 


Proof Since by definition of A’, AU A’ = while A and A’ are disjoint, 1 = P(S) = 
P(AUA’)= P(A) +P(A’), from which the desired result follows. | 


This proposition is surprisingly useful because there are many situations in which 
P(A’) is more easily obtained by direct methods than is P(A). 


Example 1.13 Consider a system of five identical components connected in series, 
as illustrated in Fig. 1.3. 

Denote a component that fails by F and one that doesn’t fail by S (for success). 
Let A be the event that the system fails. For A to occur, at least one of the individual 
components must fail. Outcomes in A include SSFSS (1, 2, 4, and 5 all work, but 
3 does not), FFSSS, and so on. There are, in fact, 31 different outcomes in A! 
However, A’, the event that the system works, consists of the single outcome SSSSS. 
We will see in Sect. 1.5 that if 90% of all these components do not fail and different 
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Fig. 1.3 A system of five components connected in series 


components fail independently of one another, then P(A’) = .9° = 59. Thus P(A)= 
1 — .59 =.41; so among a large number of such systems, roughly 41% will fail. @ 


In general, the Complement Rule is useful when the event of interest can be 
expressed as “at least ...,” because the complement “less than . . .” may be easier to 
work with. (In some problems, “more than ...” is easier to deal with than “at most 
...’) When you are having difficulty calculating P(A) directly, think of determining 
P(A’). 


PROPOSITION 
For any event A, P(A) < 1. 


This follows from the previous proposition: 1 = P(A)+ P(A’) > P(A), because 
P(A’) >0 by Axiom 1. 

When A and B are disjoint, we know that P(A U B) = P(A) + P(B). How can this 
union probability be obtained when the events are not disjoint? 


ADDITION RULE 
For any events A and B, 


P(AUB)=P(A)+P(B) — P(ANB). 


Notice that the proposition is valid even if A and B are disjoint, since then 
P(AMB)=0. The key idea is that, in adding P(A) and P(B), the probability of the 
intersection AM B is actually counted twice, so P(AMB) must be subtracted out. 


Proof Note first that AUB =AU(BNM4A’), as illustrated in Fig. 1.4. Because A and 
(BN A’) are disjoint, P(AU B)= P(A)+P(BN A’). But B=(BNA)U(BNA’) (the 
union of that part of B in A and that part of B not in A). Furthermore, (BA) and 
(BNA’) are disjoint, so that P(B) = P(BNA)+P(BN A’). Combining these results 
gives 
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Fig. 1.4 Representing A UB as a union of disjoint events a 


Example 1.14 In a certain residential suburb, 60% of all households get internet 
service from the local cable company, 80% get television service from that com- 
pany, and 50% get both services from the company. If a household is randomly 
selected, what is the probability that it gets at least one of these two services from 
the company, and what is the probability that it gets exactly one of the services from 
the company? 

With A = {gets internet service from the cable company} and B = {gets televi- 
sion service from the cable company }, the given information implies that P(A) = .6, 
P(B)=.8, and P(AMB)=.5. The Addition Rule then applies to give 


P(gets at least one of these two services from the company) = 
P(AUB) = P(A) + P(B) —P(ANB) = 6+ .8-.5=.9 


The event that a household gets only television service from the company can be 
written as A’ B, i.e., (not internet) and television. Now Fig. 1.4 implies that 


9 = P(AUB) = P(A) + P(A NB) = .6+ P(A NB) 


from which P(A’ NB) =.3. Similarly, P(AM B’) = P(A UB) — P(B) =.1. This is all 
illustrated in Fig. 1.5, from which we see that 


P(exactly one) = P(ANB’) + P(A NB) =.14+.3=.4 


P(ANB’) P(A'NB) 


Fig. 1.5 Probabilities for Example 1.14 | 


The probability of a union of more than two events can be computed analo- 
gously. For three events A, B, and C, the result is 


P(AUBUC) = P(A) + P(B) + P(C) — P(ANB) — P(ANC) — P(BNC) 
+ P(ANBNC) 


This can be seen by examining a Venn diagram of AU BUC, which is shown in 
Fig. 1.6. When P(A), P(B), and P(C) are added, outcomes in certain intersections 
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Fig. 1.6 AUBUC 


are double counted and the corresponding probabilities must be subtracted. But this 
results in P(AM BMC) being subtracted once too often, so it must be added back. 
One formal proof involves applying the Addition Rule to P(AUB)UC), the 
probability of the union of the two events AUB and C; see Exercise 30. More 
generally, a result concerning P(A; U---UA,) can be proved by induction or by 
other methods. The pattern of additions and subtractions (or, equivalently, the 
method of deriving such union probability formulas) is often called the inclusion— 
exclusion principle. 


1.2.3. Determining Probabilities Systematically 


When the number of possible outcomes (simple events) is large, there will be many 
compound events. A simple way to determine probabilities for these events that 
avoids violating the axioms and derived properties is to first determine probabilities 
P(E;) for all simple events. These should satisfy P(E;)>0 and 24 ; P(E) =1. 
Then the probability of any compound event A is computed by adding together the 
P(E;)s for all Es in A: 


all E;s in A 


Example 1.15 During off-peak hours a commuter train has five cars. Suppose a 
commuter is twice as likely to select the middle car (#3) as to select either adjacent 
car (#2 or #4), and is twice as likely to select either adjacent car as to select either 
end car (#1 or #5). Let p;=P(car i is selected)=P(E;). Then we have 
P3 = 2p2 = 2p4 and p2 = 2p; = 2p5 = p4. This gives 


l=) PE) =p, + 2p, + 4p, + 2p, + p; = 10p, 


implying p) = ps =.1, p2=p4=.2, and p3 = .4. The probability that one of the three 
middle cars is selected (a compound event) is then p2+p3+p4=.8. | 


1.2.4 Equally Likely Outcomes 


In many experiments consisting of N outcomes, it is reasonable to assign equal 
probabilities to all N simple events. These include such obvious examples as tossing 
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a fair coin or fair die once (or any fixed number of times), or selecting one or several 
cards from a well-shuffled deck of 52. With p = P(E;) for every i, 


N 
1=) P(E) = p=p-N so p=5 


i=l i=l 


That is, if there are N possible outcomes, then the probability assigned to each is 
1/N. 

Now consider an event A, with (A) denoting the number of outcomes contained 
in A. Then 


PA)= > PB) = y= 
E; inA E; inA 
Once we have counted the number N of outcomes in the sample space, to 
compute the probability of any event we must count the number of outcomes 
contained in that event and take the ratio of the two numbers. Thus when outcomes 
are equally likely, computing probabilities reduces to counting. 


Example 1.16 When two dice are rolled separately, there are N=36 outcomes 
(delete the first row and column from the table in Example 1.3). If both the dice are 
fair, all 36 outcomes are equally likely, so P(E;) = 1/36. Then the event A = {sum of 


two numbers is 8} consists of the five outcomes (J, &), @, &), (4, &), (2, &), and 
(i, Gl), so 
N(A 
pay = NAS 
N 36 a 


The next section of this book investigates counting methods in depth. 


1.2.5 Exercises: Section 1.2 (13-30) 


13. A mutual fund company offers its customers several different funds: a money- 
market fund, three different bond funds (short, intermediate, and long-term), 
two stock funds (moderate and high-risk), and a balanced fund. Among 
customers who own shares in just one fund, the percentages of customers in 
the different funds are as follows: 


Money-market 20% High-risk stock 18% 
Short bond 15% Moderate-risk stock 25% 
Intermediate bond 10% Balanced 1% 


Long bond 5% 


14. 


15. 


16. 


17. 


18. 
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A customer who owns shares in just one fund is randomly selected. 

(a) What is the probability that the selected individual owns shares in the 
balanced fund? 

(b) What is the probability that the individual owns shares in a bond fund? 

(c) What is the probability that the selected individual does not own shares in 
a stock fund? 

Consider randomly selecting a student at a certain university, and let A denote 

the event that the selected individual has a Visa credit card and B be the 

analogous event for a MasterCard. Suppose that P(A)=.5, P(B)=.4, and 

P(ANB)=.25. 

(a) Compute the probability that the selected individual has at least one of the 
two types of cards (i.e., the probability of the event A UB). 

(b) What is the probability that the selected individual has neither type of 
card? 

(c) Describe, in terms of A and B, the event that the selected student has a 
Visa card but not a MasterCard, and then calculate the probability of this 
event. 

A computer consulting firm presently has bids out on three projects. 

Let A;= {awarded project i}, for i=1, 2, 3, and suppose that P(A,)=.22, 

P(A2) = .25, P(A3) =.28, P(A; NA2) = .11, P(A, NA3) = .05, P(A2M A3) = .07, 

P(A,NA2MA3)=.01. Express in words each of the following events, and 

compute the probability of each event: 

(a) A,UA> 

(b) A{NA4 [Hint: Use De Morgan’s Law. ] 

(c) AyUA,UA3 

(d) AN ASN AS 

(e) ALNASNA3 

(f) (AiNA3)UA3 

Suppose that 55% of all adults regularly consume coffee, 45% regularly 

consume soda, and 70% regularly consume at least one of these two products. 

(a) What is the probability that a randomly selected adult regularly consumes 
both coffee and soda? 

(b) What is the probability that a randomly selected adult doesn’t regularly 
consume either of these two products? 

Consider the type of clothes dryer (gas or electric) purchased by each of five 

different customers at a certain store. 

(a) If the probability that at most one of these customers purchases an electric 
dryer is .428, what is the probability that at least two purchase an electric 
dryer? 

(b) If P(all five purchase gas) = .116 and P(all five purchase electric) = .005, 
what is the probability that at least one of each type is purchased? 

An individual is presented with three different glasses of cola, labeled C, D, 

and P. He is asked to taste all three and then list them in order of preference. 

Suppose the same cola has actually been put into all three glasses. 
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(a) What are the simple events in this ranking experiment, and what proba- 
bility would you assign to each one? 

(b) What is the probability that C is ranked first? 

(c) What is the probability that C is ranked first and D is ranked last? 

19. Let A denote the event that the next request for assistance from a statistical 
software consultant relates to the SPSS package, and let B be the event that the 
next request is for help with SAS. Suppose that P(A) = .30 and P(B) = .50. 
(a) Why is it not the case that P(A) + P(B) = 1? 

(b) Calculate P(A’). 
(c) Calculate P(A UB). 
(d) Calculate P(A’ NB’). 

20. A box contains six 40-W bulbs, five 60-W bulbs, and four 75-W bulbs. If 
bulbs are selected one by one in random order, what is the probability that at 
least two bulbs must be selected to obtain one that is rated 75 W? 

21. Human visual inspection of solder joints on printed circuit boards can be very 
subjective. Part of the problem stems from the numerous types of solder 
defects (e.g., pad nonwetting, knee visibility, voids) and even the degree to 
which a joint possesses one or more of these defects. Consequently, even 
highly trained inspectors can disagree on the disposition of a particular joint. 
In one batch of 10,000 joints, inspector A found 724 that were judged 
defective, inspector B found 751 such joints, and 1159 of the joints were 
judged defective by at least one of the inspectors. Suppose that one of the 
10,000 joints is randomly selected. 

(a) What is the probability that the selected joint was judged to be defective 
by neither of the two inspectors? 

(b) What is the probability that the selected joint was judged to be defective 
by inspector B but not by inspector A? 

22. A factory operates three different shifts. Over the last year, 200 accidents have 
occurred at the factory. Some of these can be attributed at least in part to unsafe 
working conditions, whereas the others are unrelated to working conditions. 
The accompanying table gives the percentage of accidents falling in each type 
of accident—shift category. 


Shift Unsafe conditions Unrelated to conditions 
Day 10% 35% 
Swing 8% 20% 
Night 5% 22% 


Suppose one of the 200 accident reports is randomly selected from a file of 
reports, and the shift and type of accident are determined. 
(a) What are the simple events? 
(b) What is the probability that the selected accident was attributed to unsafe 
conditions? 
(c) What is the probability that the selected accident did not occur on the day 
shift? 
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An insurance company offers four different deductible levels—none, low, 
medium, and high—for its homeowner’s policyholders and three different 
levels—low, medium, and high—for its automobile policyholders. The 
accompanying table gives proportions for the various categories of 
policyholders who have both types of insurance. For example, the proportion 
of individuals with both low homeowner’s deductible and low auto deductible 
is .06 (6% of all such individuals). 


Homeowner’s 


Auto N L M H 
IL, 04 .06 05 03 
M 07 10 20 10 
H 02 03 15 15 


Suppose an individual having both types of policies is randomly selected. 

(a) What is the probability that the individual has a medium auto deductible 
and a high homeowner’s deductible? 

(b) What is the probability that the individual has a low auto deductible? A low 
homeowner’s deductible? 

(c) What is the probability that the individual is in the same category for both 
auto and homeowner’s deductibles? 

(d) Based on your answer in part (c), what is the probability that the two 
categories are different? 

(e) What is the probability that the individual has at least one low deductible 
level? 

(f) Using the answer in part (e), what is the probability that neither deductible 
level is low? 

The route used by a driver in commuting to work contains two intersections 

with traffic signals. The probability that he must stop at the first signal is .4, the 

analogous probability for the second signal is .5, and the probability that he 

must stop at one or more of the two signals is .6. What is the probability that he 

must stop 

(a) At both signals? 

(b) At the first signal but not at the second one? 

(c) At exactly one signal? 

The computers of six faculty members in a certain department are to be 

replaced. Two of the faculty members have selected laptop machines and the 

other four have chosen desktop machines. Suppose that only two of the setups 

can be done on a particular day, and the two computers to be set up are 

randomly selected from the six (implying 15 equally likely outcomes; if the 

computers are numbered 1, 2, ... , 6, then one outcome consists of computers 

1 and 2, another consists of computers | and 3, and so on). 
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(a) What is the probability that both selected setups are for laptop computers? 
(b) What is the probability that both selected setups are desktop machines? 
(c) What is the probability that at least one selected setup is for a desktop 
computer? 
(d) What is the probability that at least one computer of each type is chosen for 
setup? 
Show that if one event A is contained in another event B (i.e., A is a subset of B), 
then P(A) < P(B). [Hint: For such A and B, A and BNA’ are disjoint and 
B=AU(BNA’), as can be seen from a Venn diagram.] For general A and B, 
what does this imply about the relationship among P(AMB), P(A), and 
P(AUB)? 
The three most popular options on a certain type of new car are a built-in GPS 
(A), a sunroof (B), and an automatic transmission (C). If 40% of all purchasers 
request A, 55% request B, 70% request C, 63% request A or B, 77% request A or 
C, 80% request B or C, and 85% request A or B or C, compute the probabilities 
of the following events. 
(a) The next purchaser will request at least one of the three options. 
(b) The next purchaser will select none of the three options. 
(c) The next purchaser will request only an automatic transmission and neither 
of the other two options. 
(d) The next purchaser will select exactly one of these three options. 
[Hint: “A or B” is the event that at least one of the two options is requested; 
try drawing a Venn diagram and labeling all regions.] 
A certain system can experience three different types of defects. Let A; (i= 1, 
2, 3) denote the event that the system has a defect of type 7. Suppose that 


P(A,) =.12 P(A>) = .07 P(A3) =.05 
P(A, UA) = .13 P(A, UA3) =.14 
P(A,UA3) =.10 P(A,NA3MA3) =.01 


(a) What is the probability that the system does not have a type | defect? 

(b) What is the probability that the system has both type | and type 2 defects? 

(c) What is the probability that the system has both type | and type 2 defects 
but not a type 3 defect? 

(d) What is the probability that the system has at most two of these defects? 

In Exercise 7, suppose that any incoming individual is equally likely to be 

assigned to any of the three stations irrespective of where other individuals 

have been assigned. What is the probability that 

(a) All three family members are assigned to the same station? 

(b) At most two family members are assigned to the same station? 

(c) Every family member is assigned to a different station? 

Apply the proposition involving the probability of A U B to the union of the two 

events (A UB) and C in order to verify the result for P(A UBUC). 
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1.3. Counting Methods 


When the various outcomes of an experiment are equally likely (the same proba- 
bility is assigned to each simple event), the task of computing probabilities reduces 
to counting. Equally likely outcomes arise in many games, including the six sides of 
a fair die, the two sides of a fair coin, and the 38 slots of a fair roulette wheel. As 
mentioned at the end of the last section, if N is the number of outcomes in a sample 
space and N(A) is the number of outcomes contained in an event A, then 


P(A) = (1.1) 


If a list of the outcomes is available or easy to construct and N is small, then the 
numerator and denominator of Eq. (1.1) can be obtained without the benefit of any 
general counting principles. 

There are, however, many experiments for which the effort involved in 
constructing such a list is prohibitive because N is quite large. By exploiting 
some general counting rules, it is possible to compute probabilities of the form 
(1.1) without a listing of outcomes. These rules are also useful in many problems 
involving outcomes that are not equally likely. Several of the rules developed here 
will be used in studying probability distributions in the next chapter. 


1.3.1 The Fundamental Counting Principle 


Our first counting rule applies to any situation in which an event consists of ordered 
pairs of objects and we wish to count the number of such pairs. By an ordered pair, 
we mean that, if O; and O2 are objects, then the pair (O;, O2) is different from the 
pair (Oz, O,). For example, if an individual selects one airline for a trip from Los 
Angeles to Chicago and a second one for continuing on to New York, one 
possibility is (American, United), another is (United, American), and still another 
is (United, United). 


PROPOSITION 

If the first element or object of an ordered pair can be selected in n, ways, and 
for each of these 1; ways the second element of the pair can be selected in nz 
ways, then the number of pairs is 1). 


Example 1.17 A homeowner doing some remodeling requires the services of both 
a plumbing contractor and an electrical contractor. If there are 12 plumbing 
contractors and 9 electrical contractors available in the area, in how many ways 
can the contractors be chosen? If we denote the plumbers by P;, ..., P;2 and the 
electricians by Q), ..., Qo, then we wish the number of pairs of the form (P;, Q)). 
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With n, = 12 and nz =9, the proposition yields N = (12)(9) = 108 possible ways of 
choosing the two types of contractors. a 


In Example 1.17, the choice of the second element of the pair did not depend on 
which first element was chosen or occurred. As long as there is the same number of 
choices of the second element for each first element, the proposition above is valid 
even when the set of possible second elements depends on the first element. 


Example 1.18 A family has just moved to a new city and requires the services of 
both an obstetrician and a pediatrician. There are two easily accessible medical 
clinics, each having two obstetricians and three pediatricians. The family will obtain 
maximum health insurance benefits by joining a clinic and selecting both doctors 
from that clinic. In how many ways can this be done? Denote the obstetricians by O,, 
O2, O3, and O, and the pediatricians by P,, ..., Ps. Then we wish the number of pairs 
(Oj, P;) for which O; and P; are associated with the same clinic. Because there are four 
obstetricians, n; =4, and for each there are three choices of pediatrician, so nz = 3. 
Applying the proposition gives N = n,nz= 12 possible choices. a 


If a six-sided die is tossed five times in succession, then each possible outcome is 
an ordered collection of five numbers such as (4), ©], 4), G), &) or (8, 2), GG, Gl). We 
will call an ordered collection of k objects a k-tuple (so a pair is a 2-tuple and a 
triple is a 3-tuple). Each outcome of the die-tossing experiment is then a 5-tuple. 
The following theorem, called the Fundamental Counting Principle, generalizes the 
previous proposition to k-tuples. 


FUNDAMENTAL COUNTING PRINCIPLE 

Suppose a set consists of ordered collections of k elements (k-tuples) and that 
there are 1, possible choices for the first element; for each choice of the first 
element, there are nz possible choices of the second element;...; for each 
possible choice of the first k—1 elements, there are nj, choices of the kth 
element. Then there are 1,n: - - nz, possible k-tuples. 


Example 1.19 (Example 1.17 continued) Suppose the home remodeling job 
involves first purchasing several kitchen appliances. They will all be purchased 
from the same dealer, and there are five dealers in the area. With the dealers denoted 
by Dj, ..., Ds, there are N = nyngn3 = (5)(12)(9) = 540 3-tuples of the form (D;, P;, 
Q,), so there are 540 ways to choose first an appliance dealer, then a plumbing 
contractor, and finally an electrical contractor. | 


Example 1.20 (Example 1.18 continued) If each clinic has both three specialists in 
internal medicine and two general surgeons, there are 1,n2n3n4 = (4)(3)(3)(2) = 72 
ways to select one doctor of each type such that all doctors practice at the same 
clinic. = 
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Fig. 1.7 Tree diagram for 
Example 1.18 


1.3.2. Tree Diagrams 


In many counting and probability problems, a configuration called a tree diagram 
can be used to represent pictorially all the possibilities. The tree diagram associated 
with Example 1.18 appears in Fig. 1.7. Starting from a point on the left side of the 
diagram, for each possible first element of a pair a straight-line segment emanates 
rightward. Each of these lines is referred to as a first-generation branch. Now for 
any given first-generation branch we construct another line segment emanating 
from the tip of the branch for each possible choice of a second element of the pair. 
Each such line segment is a second-generation branch. Because there are four 
obstetricians, there are four first-generation branches, and three pediatricians for 
each obstetrician yields three second-generation branches emanating from each 
first-generation branch. 

Generalizing, suppose there are n, first-generation branches, and for each first- 
generation branch there are nz second-generation branches. The total number of 
second-generation branches is then 1,2. Since the end of each second-generation 
branch corresponds to exactly one possible pair (choosing a first element and then a 
second puts us at the end of exactly one second-generation branch), there are n,n 
pairs, verifying our first proposition. 

The Fundamental Counting Principle can also be illustrated by a tree diagram; 
simply construct a more elaborate diagram by adding third-generation branches 
emanating from the tip of each second generation, then fourth-generation branches, 
and so on, until finally kth-generation branches are added. 

The construction of a tree diagram does not depend on having the same number 
of second-generation branches emanating from each first-generation branch. If the 
second clinic had four pediatricians, then there would be only three branches 
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emanating from two of the first-generation branches and four emanating from each 
of the other two first-generation branches. A tree diagram can thus be used to 
represent experiments for which the Fundamental Counting Principle does not 


apply. 


1.3.3 Permutations 


So far the successive elements of a k-tuple were selected from entirely different 
sets (e.g., appliance dealers, then plumbers, and finally electricians). In several 
tosses of a die, the set from which successive elements are chosen is always 
{O, G, &, &, &, }, but the choices are made “with replacement” so that the 
same element can appear more than once. If the die is rolled once, there are 
obviously 6 possible outcomes; for two rolls, there are 6” = 36 possibilities, since 
we distinguish (, El) from (, &). In general, if k selections are made with 
replacement from a set of n distinct objects (such as the six sides of a die), then 
the total number of possible outcomes is n’. 

We now consider a fixed set consisting of n distinct elements and suppose that a 
k-tuple is formed by selecting successively from this set without replacement, so 
that an element can appear in at most one of the k positions. 


DEFINITION 

Any ordered sequence of k objects taken without replacement from a set of 
n distinct objects is called a permutation of size k of the objects. The number 
of permutations of size k that can be constructed from the n objects is denoted 
by rie ie 


The number of permutations of size k is obtained immediately from the Funda- 
mental Counting Principle. The first element can be chosen in n ways; for each of 
these n ways the second element can be chosen in n — | ways; and so on. Finally, 
for each way of choosing the first k — 1 elements, the kth element can be chosen in 
n—(k—1)=n—k+1 ways, so 


nPp = n(n — 1)(n—2)-+-+-(n-—k+2)(n—k4+1) 


Example 1.21 Ten teaching assistants are available for grading papers in a partic- 
ular course. The first exam consists of four questions, and the professor wishes to 
select a different assistant to grade each question (only one assistant per question). 
In how many ways can assistants be chosen to grade the exam? Here n= the 
number of assistants= 10 and k=the number of questions = 4. The number of 
different grading assignments is then jyP4 = (10)(9)(8)(7) = 5040. a 
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Example 1.22 The Birthday Problem. Disregarding the possibility of a February 
29 birthday, suppose a randomly selected individual is equally likely to have been 
born on any one of the other 365 days. If ten people are randomly selected, what is 
the probability that all have different birthdays? 

Imagine selecting 10 days, with replacement, from the calendar to represent the 
birthdays of the ten randomly selected people. One possible outcome of this 
selection would be (March 31, December 30, ..., September 27, February 12). 
There are 365'° such outcomes. The number of outcomes among them with no 
repeated birthdays is 


(365) (364): ; (356) = 365 10 


(any of the 365 calendar days may be selected first; if March 31 is chosen, any of the 
other 364 days is acceptable for the second selection; and so on). Hence, the 
probability all ten randomly selected people have different birthdays equals 
365P 10/365? = 883. Equivalently, there’s only a .117 chance that at least two 
people out of these ten will share a birthday. It’s worth noting that the first 
probability can be rewritten as 


yosP10 365 364 356 


3650 365 365-365 


We may think of each fraction as representing the chance the next birthday 
selected will be different from all previous ones. (This is an example of conditional 
probability, the topic of the next section.) 

Now replace 10 with k (i.e., k randomly selected birthdays); what is the smallest 
k for which there is at least a 50-50 chance that two or more people will have the 
same birthday? Most people incorrectly guess that we need a very large group of 
people for this to be true; the most common guess is that 183 people are required 
(half the days on the calendar). But the required value of k is actually much smaller: 
the probability that k randomly selected people all have different birthdays equals 
365P;/365", which not surprisingly decreases as k increases. Figure 1.8 displays this 
probability for increasing values of k. As it turns out, the smallest k for which this 
probability falls below .5 is just k= 23. That is, there is less than a 50-50 chance 
(.4927, to be precise) of 23 randomly selected people all having different birthdays, 
and thus a probability .5073 that at least two people in a random sample of 23 will 
share a birthday. 
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P(no shared birthdays among & people) 


Fig. 1.8 P(no birthday match) in Example 1.22 a 


The expression for ,,P, can be rewritten with the aid of factorial notation. Recall 
that 7! (read “7 factorial”) is compact notation for the descending product of 
integers (7)(6)(5)(4)(3)(2)(1). More generally, for any positive integer m, m! = 
m(m— 1)(m— 2)---(2)(1). This gives 1! = 1, and we also define 0! = 1. 

Using factorial notation, (10)(9)(8)(7) = (10)(9)(8)(7)(6!)/6! = 10!/6!. More 
generally, 


ny =n(n—1)-+++-(n-k +1) 
n(n—1)--+--(n—k+1)(n—k)(n—k—1)---+- (2)(1) 


- (n—k)(n—k—1)--+--(2)(1) 


which becomes 


! 
Phe Sa 
(n— k)! 
For example, 9P3= 9!/(9 — 3)! = 9!/6! =9-8-7-6!/6!=9-8-7. Note also that 
because 0! = 1, ,P,=n!/(n—n)! =n!/0! =n!/1 =n!, as it should. 


1.3.4 Combinations 


Often the objective is to count the number of unordered subsets of size k that can be 
formed from a set consisting of n distinct objects. For example, in bridge it is only 
the 13 cards in a hand and not the order in which they are dealt that is important; in 
the formation of a committee, the order in which committee members are listed is 
frequently unimportant. 
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DEFINITION 
Given a set of n distinct objects, any unordered subset of size k of the objects 
is called a combination. The number of combinations of size k that can be 


formed from n distinct objects will be denoted by ({) or ,C,. 


The number of combinations of size k from a particular set is smaller than the 
number of permutations because, when order is disregarded, some of the 
permutations correspond to the same combination. Consider, for example, the set 
{A, B, C, D, E} consisting of five elements. There are 5P3=5!/(5 — 3)! =60 
permutations of size 3. There are six permutations of size 3 consisting of the 
elements A, B, and C because these three can be ordered 3-2-1=3!=6 ways: 
(A, B, C), (A, C, B), (B, A, C), (B, C, A), (C, A, B), and (C, B, A). These six 
permutations are equivalent to the single combination {A, B, C}. Similarly, for any 
other combination of size 3, there are 3! permutations, each obtained by ordering 
the three objects. Thus, 


0 = P= (5) -3! so (3)-G-0 


These ten combinations are 
{A,B, C}{A, B, D}{A, B, E}{A, C, D}{A, C, E} 
{A,D, E}{B,C, D}{B,C, E}{B, D, E}{C, D, E} 
When there are n distinct objects, any permutation of size k is obtained by 


ordering the k unordered objects of a combination in one of k! ways, so the number 
of permutations is the product of k! and the number of combinations. This gives 


C= n Pe n! 
mE NK) RL RnB! 


Notice that ( é ) = land & ) = | because there is only one way to choose a set 


n 


fl ) = nsince there are n subsets of size 1. 


of (all) n elements or of no elements, and ( 


Example 1.23 A bridge hand consists of any 13 cards selected from a 52-card deck 
without regard to order. There are & = 52!/(13! - 39!) different bridge hands, 


which works out to approximately 635 billion. Since there are 13 cards in each suit, 
the number of hands consisting entirely of clubs and/or spades (no red cards) is 


& = 26!/(13!- 13!) = 10,400,600. One of these & hands consists 
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entirely of spades, and one consists entirely of clubs, so there are [ & — 2) 


hands that consist entirely of clubs and spades with both suits represented in the 
hand. Suppose a bridge hand is dealt from a well-shuffled deck (i.e., 13 cards are 
randomly selected from among the 52 possibilities) and let 
A= {the hand consists entirely of spades and clubs with both suits represented } 
B= {the hand consists of exactly two suits} 


The N= fa possible outcomes are equally likely, so 


—2 
N(A) ( 13 ) 
P(A) = = = .0000164 
(4) N 52 
13 
Since there are (5) = 6combinations consisting of two suits, of which spades 


and clubs is one such combination, 


26 
P(B) = “ (( i) : | = .0000983 
ae 52 _ 
13 
That is, a hand consisting entirely of cards from exactly two of the four suits will 


occur roughly once in every 10,000 hands. If you play bridge only once a month, it 
is likely that you will never be dealt such a hand. a 


Example 1.24 A university warehouse has received a shipment of 25 printers, of 
which 10 are laser printers and 15 are inkjet models. If 6 of these 25 are selected at 
random to be checked by a particular technician, what is the probability that exactly 
3 of those selected are laser printers (so that the other 3 are inkjets)? 

Let D3 = {exactly 3 of the 6 selected are inkjet printers}. Assuming that any 
particular set of 6 printers is as likely to be chosen as is any other set of 6, we have 
equally likely outcomes, so P(D3) = N(D3)/N, where N is the number of ways of 
choosing 6 printers from the 25 and N(D3) is the number of ways of choosing 3 laser 


printers and 3 inkjet models. Thus N = & } To obtain N(D3), think of first 
choosing 3 of the 15 inkjet models and then 3 of the laser printers. There are ( rs ) 


ways of choosing the 3 inkjet models, and there are ( = ways of choosing the 


3 laser printers; by the Fundamental Counting Principle, N(D3) is the product of 
these two numbers. So 
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( 15\ (10 15! 10! 
_N(Os)_\ 3/3 7 _ 32h 3i7 
P(D3)=— = is = ap = 3083 
6 6!19! 
Let D4= {exactly 4 of the 6 printers selected are inkjet models} and define Ds; 


and De in an analogous manner. Notice that the events D3, D4, Ds, and De are 
disjoint. Thus, the probability that at /east 3 inkjet printers are selected is 


P(D; UD4U Ds U D6) = P(D3) + P(D4) + P(Ds) + P(De) 


GG) COC) GIG) CC) _ sag 
(s)  (e) GG) Ge) 


Example 1.25 The article “Does Your iPod Really Play Favorites?” (The Amer. 
Statistician, 2009: 263-268) investigated the randomness of the iPod’s shuffling 
process. One professor’s iPod playlist contains 100 songs, of which 10 are by the 
Beatles. Suppose the shuffle feature is used to play the songs in random order. What 
is the probability that the first Beatles song heard is the fifth song played? 

In order for this event to occur, it must be the case that the first four songs played 
are not Beatles songs (NBs) and that the fifth song is by the Beatles (B). The total 
number of ways to select the first five songs is (100)(99)(98)(97)(96), while the 
number of ways to select these five songs so that the first four are NBs and the next 
is a B is (90)(89)(88)(87)(10). The random shuffle assumption implies that every 
sequence of 5 songs from amongst the 100 has the same chance of being selected as 
the first 5 played, i.e., each outcome (a list of 5 songs) is equally likely. Therefore 
the desired probability is 


 90-89- 88-87-10 oP, - 10 


= ~ = .0679 
100-99- 98-97-96 sa9Ps 


P(1* B is the 5"song played) 


Here is an alternative line of reasoning involving combinations. Rather than 
focusing on selecting just the first 5 songs, think of playing all 100 songs in random 
order. The number of ways of choosing 10 of these songs to be the Bs (without 


regard to the order in which they are played) is ( m ) . Now if we choose 9 of the 


10 
last 95 songs to be Bs, which can be done in | ) ways, that leaves four NBs and 


one B for the first five songs. Finally, there is only one way for these first five songs 
to start with four NBs and then follow with a B (remember that we are considering 
unordered subsets). Thus 
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95 
P(1™ B is the 5"song played) = z 
100 
10 
It is easily verified that this latter expression is in fact identical to the previous 
expression for the desired probability, so the numerical result is again .0679. 


By similar reasoning, the probability that one of the first five songs played is a 
Beatles song is 


P(1st B is the 1st or 2nd or 3rd or 4th or 5th song played) 


99 98 97 96 95 
es + : : + 2 + PD es 4162 
~ (100 100\ — / 100 100 100\ 
10 10 10 10 10 
It is thus rather likely that a Beatles song will be one of the first five 


songs played. Such a “coincidence” is not as surprising as might first appear to be 
the case. = 


1.3.5 Exercises: Section 1.3 (31-49) 


31. An ATM personal identification number (PIN) consists of a four-digit 
sequence. 

(a) How many different possible PINs are there if there are no restrictions on 

the possible choice of digits? 

(b) According to a representative at the authors’ local branch of Chase Bank, 

there are in fact restrictions on the choice of digits. The following choices 
are prohibited: (1) all four digits identical; (2) sequences of consecutive 
ascending or descending digits, such as 6543; (3) any sequence starting 
with 19 (birth years are too easy to guess). So if one of the PINs in (a) is 
randomly selected, what is the probability that it will be a legitimate PIN 
(i.e., not be one of the prohibited sequences)? 
Someone has stolen an ATM card and knows the first and last digits of the 
PIN are 8 and 1, respectively. He also knows about the restrictions 
described in (b). If he gets three chances to guess the middle two digits 
before the ATM “eats” the card, what is the probability the thief gains 
access to the account? 

(d) Recalculate the probability in (c) if the first and last digits are 1 and 1. 
32. The College of Science Council has one student representative from each of 

the five science departments (biology, chemistry, statistics, mathematics, 

physics). In how many ways can 

(a) Both a council president and a vice president be selected? 


(c 


wa 
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34. 


35. 
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(b) A president, a vice president, and a secretary be selected? 

(c) Two council members be selected for the Dean’s Council? 

A friend of ours is giving a dinner party. Her current wine supply includes 

8 bottles of zinfandel, 10 of merlot, and 12 of cabernet (she drinks only red 

wine), all from different wineries. 

(a) If she wants to serve 3 bottles of zinfandel and serving order is important, 
how many ways are there to do this? 

(b) If 6 bottles of wine are to be randomly selected from the 30 for serving, 
how many ways are there to do this? 

(c) If 6 bottles are randomly selected, how many ways are there to obtain two 
bottles of each variety? 

(d) If 6 bottles are randomly selected, what is the probability that this results 
in two bottles of each variety being chosen? 

(e) If 6 bottles are randomly selected, what is the probability that all of them 
are the same variety? 

(a) Beethoven wrote 9 symphonies and Mozart wrote 27 piano concertos. If a 
university radio station announcer wishes to play first a Beethoven sym- 
phony and then a Mozart concerto, in how many ways can this be done? 

(b) The station manager decides that on each successive night (7 days per 
week), a Beethoven symphony will be played, followed by a Mozart piano 
concerto, followed by a Schubert string quartet (of which there are 15). 
For roughly how many years could this policy be continued before exactly 
the same program would have to be repeated? 

A stereo store is offering a special price on a complete set of components 

(receiver, compact disc player, speakers, turntable). A purchaser is offered a 

choice of manufacturer for each component: 


Receiver: Kenwood, Onkyo, Pioneer, Sony, Yamaha 
Compact disc player: Onkyo, Pioneer, Sony, Panasonic 
Speakers: Boston, Infinity, Polk 

Turntable: Onkyo, Sony, Teac, Technics 


A switchboard display in the store allows a customer to hook together any 
selection of components (consisting of one of each type). Use the Fundamen- 
tal Counting Principle to answer the following questions: 

(a) In how many ways can one component of each type be selected? 

(b) In how many ways can components be selected if both the receiver and the 
compact disc player are to be Sony? 

(c) In how many ways can components be selected if none is to be Sony? 

(d) In how many ways can a selection be made if at least one Sony component 
is to be included? 

(e) If someone flips switches on the selection in a completely random fashion, 
what is the probability that the system selected contains at least one Sony 
component? Exactly one Sony component? 
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36. 


37. 


38. 


39. 


40. 


In five-card poker, a straight consists of five cards in adjacent ranks (e.g., 9 of 

clubs, 10 of hearts, jack of hearts, queen of spades, king of clubs). Assuming 

that aces can be high or low, if you are dealt a five-card hand, what is the 
probability that it will be a straight with high card 10? What is the probability 
that it will be a straight? What is the probability that it will be a straight flush 

(all cards in the same suit)? 

A local bar stocks 12 American beers, 8 Mexican beers, and 9 German beers. 

You ask the bartender to pick out a five-beer “sampler” for you. Assume the 

bartender makes the five selections at random and without replacement. 

(a) What is the probability you get at least four American beers? 

(b) What is the probability you get five beers from the same country? 

Computer keyboard failures can be attributed to electrical defects or mechan- 

ical defects. A repair facility currently has 25 failed keyboards, 6 of which 

have electrical defects and 19 of which have mechanical defects. 

(a) How many ways are there to randomly select 5 of these keyboards for a 
thorough inspection (without regard to order)? 

(b) In how many ways can a sample of 5 keyboards be selected so that exactly 
2 have an electrical defect? 

(c) If a sample of 5 keyboards is randomly selected, what is the probability 
that at least 4 of these will have a mechanical defect? 

The statistics department at the authors’ university participates in an annual 

volleyball tournament. Suppose that all 16 department members are willing to 

play. 

(a) How many different six-person volleyball rosters could be generated? 
(That is, how many years could the department participate in the tourna- 
ment without repeating the same six-person team?) 

(b) The statistics department faculty consist of 5 women and 11 men. How 
many rosters comprising exactly 2 women and 4 men be generated? 

(c) The tournament’s rules actually require that each team include at least 
two women. Under this rule, how many valid teams could be generated? 

(d) Suppose this year the department decides to randomly select its six 
players. What is the probability the randomly selected team has exactly 
two women? At least two women? 

A production facility employs 20 workers on the day shift, 15 workers on the 

swing shift, and 10 workers on the graveyard shift. A quality control consul- 

tant is to select 6 of these workers for in-depth interviews. Suppose the 
selection is made in such a way that any particular group of 6 workers has 
the same chance of being selected as does any other group (drawing 6 slips 

without replacement from among 45). 

(a) How many selections result in all 6 workers coming from the day shift? 
What is the probability that all 6 selected workers will be from the day 
shift? 

(b) What is the probability that all 6 selected workers will be from the same 
shift? 
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(c) What is the probability that at least two different shifts will be represented 
among the selected workers? 

(d) What is the probability that at least one of the shifts will be unrepresented 
in the sample of workers? 

An academic department with five faculty members narrowed its choice for 

department head to either candidate A or candidate B. Each member then voted 

on a slip of paper for one of the candidates. Suppose there are actually three 
votes for A and two for B. If the slips are selected for tallying in random order, 
what is the probability that A remains ahead of B throughout the vote count 

(e.g., this event occurs if the selected ordering is AABAB, but not for ABBAA)? 

An experimenter is studying the effects of temperature, pressure, and type of 

catalyst on yield from a chemical reaction. Three different temperatures, four 

different pressures, and five different catalysts are under consideration. 

(a) If any particular experimental run involves the use of a single temperature, 
pressure, and catalyst, how many experimental runs are possible? 

(b) How many experimental runs involve use of the lowest temperature and 
two lowest pressures? 

(c) Suppose that five different experimental runs are to be made on the first day 
of experimentation. If the five are randomly selected from among all the 
possibilities, so that any group of five has the same probability of selection, 
what is the probability that a different catalyst is used on each run? 

A box in a certain supply room contains four 40-W lightbulbs, five 60-W bulbs, 

and six 75-W bulbs. Suppose that three bulbs are randomly selected. 

(a) What is the probability that exactly two of the selected bulbs are rated 
75 W? 

(b) What is the probability that all three of the selected bulbs have the same 
rating? 

(c) What is the probability that one bulb of each type is selected? 

(d) Suppose now that bulbs are to be selected one by one until a 75-W bulb is 
found. What is the probability that it is necessary to examine at least six 
bulbs? 

Fifteen telephones have just been received at an authorized service center. Five 

of these telephones are cellular, five are cordless, and the other five are corded 

phones. Suppose that these components are randomly allocated the numbers 

1, 2, ..., 15 to establish the order in which they will be serviced. 

(a) What is the probability that all the cordless phones are among the first ten to 
be serviced? 

(b) What is the probability that after servicing ten of these phones, phones of 
only two of the three types remain to be serviced? 

(c) What is the probability that two phones of each type are among the first six 
serviced? 
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Three molecules of type A, three of type B, three of type C, and three of type 
D are to be linked together to form a chain molecule. One such chain molecule 
is ABCDABCDABCD, and another is BCDDAAABDBCC. 

(a) How many such chain molecules are there? [Hint: If the three A’s were 
distinguishable from one another—A,, Az, A3—and the B’s, C’s, and D’s 
were also, how many molecules would there be? How is this number 
reduced when the subscripts are removed from the A’s?] 

(b) Suppose a chain molecule of the type described is randomly selected. What 
is the probability that all three molecules of each type end up next to each 
other (such as in BBBAAADDDCCC)? 

A popular Dilbert cartoon strip (popular among statisticians, anyway) shows an 

allegedly “random” number generator produce the sequence 999999 with the 

accompanying comment, “That’s the problem with randomness: you can never 
be sure.” Most people would agree that 999999 seems less “random” than, say, 

703928, but in what sense is that true? Imagine we randomly generate a 

six-digit number, i.e., we make six draws with replacement from the digits 

0 through 9. 

(a) What is the probability of generating 999999? 

(b) What is the probability of generating 703928? 

(c) What is the probability of generating a sequence of six identical digits? 

(d) What is the probability of generating a sequence with no identical digits? 
(Comparing the answers to (c) and (d) gives some sense of why some 
sequences feel intuitively more random than others.) 

(e) Here’s a real challenge: what is the probability of generating a sequence 
with exactly one repeated digit? 

Three married couples have purchased theater tickets and are seated in a row 

consisting of just six seats. If they take their seats in a completely random 

fashion (random order), what is the probability that Jim and Paula (husband and 
wife) sit in the two seats on the far left? What is the probability that Jim and 

Paula end up sitting next to one another? What is the probability that at least 

one of the wives ends up sitting next to her husband? 

A starting lineup in basketball consists of two guards, two forwards, and a 

center. 

(a) A certain college team has on its roster five guards, four forwards, and three 
centers. How many different starting lineups can be created? 

(b) Their opposing team in one particular game has three centers, four guards, 
four forwards, and one individual (X) who can play either guard or forward. 
How many different starting lineups can the opposing team create? [Hint: 
Consider lineups without X, with X as a guard, and with X as a forward.] 

(c) Now suppose a team has 4 guards, 4 forwards, 2 centers, and two players 
(X and Y) who can play either guard or forward. If 5 of the 12 players on 
this team are randomly selected, what is the probability that they constitute 
a legitimate starting lineup? 


Show that @ = é i 2 Give an interpretation involving subsets. 
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1.4 Conditional Probability 


The probabilities assigned to various events depend on what is known about the 
experimental situation when the assignment is made. Subsequent to the initial 
assignment, partial information about or relevant to the outcome of the experiment 
may become available. Such information may cause us to revise some of our 
probability assignments. For a particular event A, we have used P(A) to represent 
the probability assigned to A; we now think of P(A) as the original or “uncondi- 
tional” probability of the event A. 

In this section, we examine how the information “an event B has occurred” 
affects the probability assigned to A. For example, A might refer to an individual 
having a particular disease in the presence of certain symptoms. If a blood test is 
performed on the individual and the result is negative (B = negative blood test), 
then the probability of having the disease will change (it should decrease, but not 
usually to zero, since blood tests are not infallible). 


Example 1.26 Complex components are assembled in a plant that uses two 
different assembly lines, A and A’. Line A uses older equipment than A’, so it is 
somewhat slower and less reliable. Suppose on a given day line A has assembled 
8 components, of which 2 have been identified as defective (B) and 6 as 
nondefective (B’), whereas A’ has produced 1 defective and 9 nondefective 
components. This information is summarized in the accompanying table. 


Condition 
Line B B’ 
A 2 6 
A’ 1 9 


Unaware of this information, the sales manager randomly selects | of these 
18 components for a demonstration. Prior to the demonstration 


N(A 8 
P(line A component selected) = P(A) wa) 18 


444 


However, if the chosen component turns out to be defective, then the event B has 
occurred, so the component must have been one of the 3 in the B column of the 
table. Since these 3 components are equally likely among themselves, the probabil- 
ity the component was selected from line A, given that event B has occurred, is 


2. 2/18  P(ANB) 
3 3/18  ~=P(B) 


P(A, given B) = (1.2) 
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In Eq. (1.2), the conditional probability is expressed as a ratio of unconditional 
probabilities. The numerator is the probability of the intersection of the two events, 
whereas the denominator is the probability of the conditioning event B. A Venn 
diagram illuminates this relationship (Fig. 1.9). 

Given that B has occurred, the relevant sample space is no longer ¥ but consists 
of just outcomes in B, and A has occurred if and only if one of the outcomes in the 
intersection AMB occurred. So the conditional probability of A given B should, 
logically, be the ratio of the likelihoods of these two events. 


ANB = what remains 
of event A 


A 


“conditioning” on event B 
> 
B 


8 B = new “sample 
space” 


Fig. 1.9 Motivating the definition of conditional probability 


1.4.1 The Definition of Conditional Probability 


Example 1.26 demonstrates that when outcomes are equally likely, computation of 
conditional probabilities can be based on intuition. When experiments are more 
complicated, though intuition may fail us, we want to have a general definition 
of conditional probability that will yield intuitive answers in simple problems. 
Figure 1.9 and Eq. (1.2) suggest the appropriate definition. 


DEFINITION 
For any two events A and B with P(B) > 0, the conditional probability of A 
given that B has occurred, denoted P(AIB), is defined by 


P(ANB) 


P| Al) By = P(B) (3) 


Example 1.27 Suppose that of all individuals buying a certain digital camera, 60% 
include an optional memory card in their purchase, 40% include an extra battery, 
and 30% include both a card and battery. Consider randomly selecting a buyer and 
let A= {memory card purchased} and B = {battery purchased}. Then P(A) = .60, P 
(B) = 0, and P(both purchased) = P(A NM B) = .30. Given that the selected individ- 
ual purchased an extra battery, the probability that an optional card was also 
purchased is 
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P(ANB) 30 


BAB) pay = ag 


75, 


That is, of all those purchasing an extra battery, 75% purchased an optional 
memory card. Similarly, 


P (battery | memory card) a P(B | A) = — — = = .50 


Notice that P(AIB) 4 P(A) and P(BIA) 4 P(B). Notice also that P(AIB) 4 P(BIA): 


these represent two different probabilities computed using difference pieces of 
“given” information. = 


Example 1.28 A news magazine includes three columns entitled “Art” (A), 
“Books” (B), and “Cinema” (C). Reading habits of a randomly selected reader 
with respect to these columns are 


Read A B Cc ANB ANC BNC ANBNC 
regularly 
Probability 14 23) 37 .08 .09 13 .05 


(See Fig. 1.10 on the next page.) 
We thus have 


P(ANB)  .08 
P(A| B)= = 34 
( | ) P(B) 23 a 
P(AN(BUC)) .04+.05+.03 .12 
P(A|B = a = — 2; 
4|8UC) =F 0G 47 rr ae 
P(AN(AUB 
P(A | reads at least one) = P(A | AUBUC) = a 
P(A 14 
a (4) = = .286 


P(AUBUC) 49 


and 
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Fig. 1.10 Venn diagram for 
Example 1.28 A B 


i 
(os 
5 ey! 


P((AUB)NC) 04+ .05+.08 _ 
P(C) 7 37 7 


P(AUB|C)= 459 7 


1.4.2. The Multiplication Rule for P(A‘ B) 


The definition of conditional probability yields the following result, obtained by 
multiplying both sides of Eq. (1.3) by P(B). 


MULTIPLICATION RULE 
P(ANB)=P(AIB)- P(B) 


This rule is important because it is often the case that P(AMB) is desired, 
whereas both P(B) and P(AIB) can be specified from the problem description. 
By reversing the roles of A and B, the Multiplication Rule can also be written as 
P(ANB) = P(BIA)- P(A). 


Example 1.29 Four individuals have responded to a request by a blood bank for 
blood donations. None of them has donated before, so their blood types are 
unknown. Suppose only type O+ is desired and only one of the four actually has 
this type. If the potential donors are selected in random order for typing, what is the 
probability that at least three individuals must be typed to obtain the desired type? 

Define B = {first type not O+} and A = {second type not O+}. Since three of the 
four potential donors are not O+, P(B) = 3/4. Given that the first person typed is not 
O+, two of the three individuals left are not O+, and so P(AIB) = 2/3. The Multipli- 
cation Rule now gives 
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P(at least three individuals are typed) = P(first two typed are not O+) 


= P(ANB) 

= P(A | B) - P(B) 

23 & 

~3 4 12 

=i5 = 


The Multiplication Rule is most useful when the experiment consists of several 
stages in succession. The conditioning event B then describes the outcome of the 
first stage and A the outcome of the second, so that P(AIB)—conditioning on what 
occurs first—will often be known. The rule is easily extended to experiments 
involving more than two stages. For example, 


P(A, N Az M.A3) = P(A3 | Ar A2) P(A, Aa) 
= P(A3 | Ai Az) - P(A2 | At) - P(A1) 


where A, occurs first, followed by Ao, and finally A3. 
Example 1.30 For the blood typing experiment of Example 1.29, 


P(third type is O+) = 
P(third is|first isn’t N second isn’t) - P(second isn’t|first isn’t) - P (first isn’t) 


=a-l.t == 25 = 


When the experiment of interest consists of a sequence of several stages, it is 
convenient to represent these with a tree diagram. Once we have an appropriate tree 
diagram, probabilities and conditional probabilities can be entered on the various 
branches; this will make repeated use of the Multiplication Rule quite straightforward. 


Example 1.31 A chain of electronics stores sells three different brands of DVD 

players. Of its DVD player sales, 50% are brand | (the least expensive), 30% are 

brand 2, and 20% are brand 3. Each manufacturer offers a 1-year warranty on parts 

and labor. It is known that 25% of brand 1’s DVD players require warranty repair 

work, whereas the corresponding percentages for brands 2 and 3 are 20% and 10%, 

respectively. 

1. What is the probability that a randomly selected purchaser has bought a brand 
1 DVD player that will need repair while under warranty? 

2. What is the probability that a randomly selected purchaser has a DVD player that 
will need repair while under warranty? 

3. If a customer returns to the store with a DVD player that needs warranty repair 
work, what is the probability that it is a brand 1 DVD player? A brand 2 DVD 
player? A brand 3 DVD player? 
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P(B\ A})*P(Ay) = P(BNA)) = 125 


P(B| Ay)*P(A>) = P(BN Ap) = .060 


P(A>) = .30 
Brand 2 


P(B\A3)+P(43) = P(BNA3) = .020 


Or ; 
Tepaiy- 


P(B) = .205 


Fig. 1.11 Tree diagram for Example 1.31 


The first stage of the problem involves a customer selecting one of the three 
brands of DVD player. Let A;= {brand i is purchased}, for i= 1, 2, and 3. Then 
P(A) =.50, P(A2) = .30, and P(A3) = .20. Once a brand of DVD player is selected, 
the second stage involves observing whether the selected DVD player needs 
warranty repair. With B= {needs repair} and B’ = {doesn’t need repair}, the 
given information implies that P(BIA,) = .25, P(BIA2) = .20, and P(BIA3) = .10. 

The tree diagram representing this experimental situation is shown in Fig. 1.11. 
The initial branches correspond to different brands of DVD players; there are two 
second-generation branches emanating from the tip of each initial branch, one for 
“needs repair” and the other for “doesn’t need repair.” The probability P(A;) 
appears on the ith initial branch, whereas the conditional probabilities P(BIA;) and 
P(B'lA,) appear on the second-generation branches. To the right of each second- 
generation branch corresponding to the occurrence of B, we display the product 
of probabilities on the branches leading out to that point. This is simply 
the Multiplication Rule in action. The answer to question | is thus P(A, B)= 
P(BIA,)-P(A,) =.125. The answer to question 2 is 
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P(B) = P|(brand 1 and repair) or (brand 2 and repair) or (brand 3 and repair) | 
= P(A, OB) + P(A. NB) + P(A3 NB) 
125 + .060 + .020 = .205 


Finally, 
P(A,NB) 125 
P(A; | B) = P(B) 208 61 
P(A2B) — .060 
BGS |) P(B) 205. 
and 


P(A3 | B) =1—P(A, | B) — P(A2 | B) = .10 


Notice that the initial or prior probability of brand 1 is .50, whereas once it is 
known that the selected DVD player needed repair, the posterior probability of 
brand 1 increases to .61. This is because brand 1 DVD players are more likely to 
need warranty repair than are the other brands. The posterior probability of brand 
3 is P(A3IB) = .10, which is much less than the prior probability P(A3)=.20. lf 


1.4.3 The Law of Total Probability and Bayes’ Theorem 


The computation of a posterior probability P(A,IB) from given prior probabilities 
P(A;) and conditional probabilities P(BIA;) occupies a central position in elementary 
probability. The general rule for such computations, which is really just a simple 
application of the Multiplication Rule, goes back to the Reverend Thomas Bayes, 
who lived in the eighteenth century. To state it we first need another result. Recall 
that events A;,..., A, are mutually exclusive if no two have any common outcomes. 
The events are exhaustive if one A; must occur, so that A; U---UA,=8. 


LAW OF TOTAL PROBABILITY 
Let Aj, ..., A, be mutually exclusive and exhaustive events. Then for any 
other event B, 


eel | Ai) -P(A1) +--+ +P(B | Ax) - (Ax) 
" (B | A;)P(Ai) 


(1.4) 
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Proof Because the A;s are mutually exclusive and exhaustive, if B occurs it must 
be in conjunction with exactly one of the A;s. That is, B= (A, and B) or ... or (Ax 
and B) = (A, NB)U---U(A;, MB), where the events (A; B) are mutually exclusive. 
This “partitioning of B” is illustrated in Fig. 1.12. Thus 


k k 


P(B) = S— P(A; B) = > P(B | Ai) P(Ai) 


i=1 i=1 


as desired. 


Fig. 1.12 Partition of B by mutually exclusive and exhaustive Ajs a 


An example of the use of Eq. (1.4) appeared in answering question 2 of Example 
1.31, where A; = {brand 1}, A. = {brand 2}, Az = {brand 3}, and B= {repair}. 


Example 1.32 A student has three different e-mail accounts. Most of her 
messages, in fact 70%, come into account #1, whereas 20% come into account #2 
and the remaining 10% into account #3. Of the messages coming into account #1, 
only 1% are spam, compared to 2% and 5% for account #2 and account #3, 
respectively. What is the student’s overall spam rate, i.e., what is the probability 
a randomly selected e-mail message received by her is spam? 

To answer this question, let’s first establish some notation: 

A; = {message is from account #i} for i= 1, 2, 3; B= {message is spam} 

The given percentages imply that 


P(A;) = .70, P(A2) = .20, P(A3) = .10 
P(B | A1) = .01,P(B | Az) = .02, P(B | A3) = .05 


Now it’s simply a matter of substituting into the equation for the Law of Total 
Probability: 


P(B) = (.01)(.70) + (.02)(.20) + (.05)(.10) = .016 


In the long run, 1.6% of her messages will be spam. a 
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BAYES’ THEOREM 
Let Aj, ..., Ay be a collection of mutually exclusive and exhaustive events 
with P(A;) > 0 fori=1,...,k. Then for any other event B for which P(B) > 0, 


ee 


ae SPB | Ai) P(Ai) 


The transition from the second to the third expression in Eq. (1.5) rests on using 
the Multiplication Rule in the numerator and the Law of Total Probability in the 
denominator. 

The proliferation of events and subscripts in Eq. (1.5) can be a bit intimidating to 
probability newcomers. When k = 2, so that the partition of consists of just A; =A 
and A, =A’, Bayes’ Theorem becomes 


P(A)P(B | A) 
P(A)P(B | A) + P(A)P(B | A) 


P(A | 8) = 


As long as there are relatively few events in the partition, a tree diagram (as in 
Example 1.29) can be used as a basis for calculating posterior probabilities without 
ever referring explicitly to Bayes’ Theorem. 


Example 1.33 Incidence of a rare disease. In the book’s Introduction, we 
presented the following example as a common misunderstanding of probability in 
everyday life. Only 1 in 1000 adults is afflicted with a rare disease for which a 
diagnostic test has been developed. The test is such that when an individual actually 
has the disease, a positive result will occur 99% of the time, whereas an individual 
without the disease will show a positive test result only 2% of the time. If a 
randomly selected individual is tested and the result is positive, what is the 
probability that the individual has the disease? 

[Note: The sensitivity of this test is 99%, whereas the specificity—how specific 
positive results are to this disease—is 98%. As an indication of the accuracy of 
medical tests, an article in the October 29, 2010 New York Times reported that the 
sensitivity and specificity for a new DNA test for colon cancer were 86% and 93%, 
respectively. The PSA test for prostate cancer has sensitivity 85% and specificity 
about 30%, while the mammogram for breast cancer has sensitivity 75% and 
specificity 92%. All tests are less than perfect.] 

To use Bayes’ Theorem, let A, = {individual has the disease}, A> = {individual 
does not have the disease}, and B= {positive test result}. Then P(A,)=.001, 
P(A) = .999, P(BIA,) = .99, and P(BIAz) = .02. The tree diagram for this problem 
is in Fig. 1.13. 
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P(A, MB) = .00099 


P(A4,M B) = .01998 


Fig. 1.13 Tree diagram for the rare-disease problem 


Next to each branch corresponding to a positive test result, the Multiplication 
Rule yields the recorded probabilities. Therefore, P(B) = .00099 + .01998 = .02097, 
from which we have 


_ P(ALAB) _ .00099 | 
EAt| BP PB) 02097 


This result seems counterintuitive; because the diagnostic test appears so accu- 
rate, we expect someone with a positive test result to be highly likely to have the 
disease, whereas the computed conditional probability is only .047. However, 
because the disease is rare and the test only moderately reliable, most positive 
test results arise from errors rather than from diseased individuals. The probability 
of having the disease has increased by a multiplicative factor of 47 (from prior .001 
to posterior .047); but to get a further increase in the posterior probability, a 
diagnostic test with much smaller error rates is needed. If the disease were not so 
rare (e.g., 25% incidence in the population), then the error rates for the present test 
would provide good diagnoses. 

This example shows why it makes sense to be tested for a rare disease only if you 
are in a high-risk group. For example, most of us are at low risk for HIV infection, 
so testing would not be indicated, but those who are in a high-risk group should be 
tested for HIV. For some diseases the degree of risk is strongly influenced by age. 
Young women are at low risk for breast cancer and should not be tested, but older 
women do have increased risk and need to be tested. There is some argument about 
where to draw the line. If we can find the incidence rate for our group and the 
sensitivity and specificity for the test, then we can do our own calculation to see if a 
positive test result would be informative. = 
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1.4.4 Exercises: Section 1.4 (50-78) 


50. The population of a particular country consists of three ethnic groups. Each 


51. 


52. 


53. 


individual belongs to one of the four major blood groups. The accompanying 
joint probability table gives the proportions of individuals in the various 
ethnic group—blood group combinations. 


Blood group 
Ethnic group O A B AB 
1 082 .106 .008 .004 
a 135 141 018 .006 
3 Alls .200 .065 .020 


Suppose that an individual is randomly selected from the population, and 
define events by A= {type A selected}, B= {type B selected}, and C= 
{ethnic group 3 selected}. 

(a) Calculate P(A), P(C), and P(ANC),. 

(b) Calculate both P(AIC) and P(CIA) and explain in context what each of 
these probabilities represents. 

(c) If the selected individual does not have type B blood, what is the proba- 
bility that he or she is from ethnic group 1? 

Suppose an individual is randomly selected from the population of all adult 

males living in the USA. Let A be the event that the selected individual is over 

6 ft in height, and let B be the event that the selected individual is a 

professional basketball player. Which do you think is larger, P(AIB) or 

P(BIA)? Why? 

Return to the credit card scenario of Exercise 14, where A= {Visa}, B= 

{MasterCard}, P(A) =.5, P(B) =.4, and P(AN B) =.25. Calculate and inter- 

pret each of the following probabilities (a Venn diagram might help). 

(a) P(BIA) 

(b) P(B'IA) 

(c) P(AIB) 

(d) P(A'IB) 

(e) Given that the selected individual has at least one card, what is the 
probability that he or she has a Visa card? 

Reconsider the system defect situation described in Exercise 28. 

(a) Given that the system has a type 1 defect, what is the probability that it has 
a type 2 defect? 

(b) Given that the system has a type | defect, what is the probability that it has 
all three types of defects? 

(c) Given that the system has at least one type of defect, what is the probabil- 
ity that it has exactly one type of defect? 

(d) Given that the system has both of the first two types of defects, what is the 
probability that it does not have the third type of defect? 
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54. The accompanying table gives information on the type of coffee selected by 


55. 


someone purchasing a single cup at a particular airport kiosk. 


Small Medium Large 
Regular 14% 20% 26% 
Decaf 20% 10% 10% 


Consider randomly selecting such a coffee purchaser. 

(a) What is the probability that the individual purchased a small cup? A cup 
of decaf coffee? 

(b) If we learn that the selected individual purchased a small cup, what now is 
the probability that s/he chose decaf coffee, and how would you interpret 
this probability? 

(c) If we learn that the selected individual purchased decaf, what now is the 
probability that a small size was selected, and yow does this compare to 
the corresponding unconditional probability from (a)? 

A department store sells sport shirts in three sizes (small, medium, and large), 

three patterns (plaid, print, and stripe), and two sleeve lengths (long and 

short). The accompanying tables give the proportions of shirts sold in the 
various category combinations. 
Short-sleeved 


Pattern 
Size Plaid Print Stripe 
S .04 .02 .05 
M .08 07 oll 
IE; .03 .07 .08 

Long-sleeved 

Pattern 
Size Plaid Print Stripe 
S .03 .02 .03 
M .10 .05 07 
L .04 .02 .08 


(a) What is the probability that the next shirt sold is a medium, long-sleeved, 
print shirt? 

(b) What is the probability that the next shirt sold is a medium print shirt? 

(c) What is the probability that the next shirt sold is a short-sleeved shirt? A 
long-sleeved shirt? 

(d) What is the probability that the size of the next shirt sold is medium? That 
the pattern of the next shirt sold is a print? 
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(e) Given that the shirt just sold was a short-sleeved plaid, what is the 
probability that its size was medium? 

(f) Given that the shirt just sold was a medium plaid, what is the probability 
that it was short-sleeved? Long-sleeved? 

One box contains six red balls and four green balls, and a second box contains 

seven red balls and three green balls. A ball is randomly chosen from the first 

box and placed in the second box. Then a ball is randomly selected from the 
second box and placed in the first box. 

(a) What is the probability that a red ball is selected from the first box and a 
red ball is selected from the second box? 

(b) At the conclusion of the selection process, what is the probability that the 
numbers of red and green balls in the first box are identical to the numbers 
at the beginning? 

A system consists of two identical pumps, #1 and #2. If one pump fails, the 
system will still operate. However, because of the added strain, the extra 
remaining pump is now more likely to fail than was originally the case. That 
is, r= P(#2 fails | #1 fails) > P(#2 fails) = q. If at least one pump fails by the 
end of the pump design life in 7% of all systems and both pumps fail during 
that period in only 1%, what is the probability that pump #1 will fail during the 
pump design life? 

A certain shop repairs both audio and video components. Let A denote the 

event that the next component brought in for repair is an audio component, 

and let B be the event that the next component is a compact disc player (so the 
event B is contained in A). Suppose that P(A) =.6 and P(B) =.05. What is 

P(BIA)? 

In Exercise 15, A; = {awarded project 7}, for i= 1, 2, 3. Use the probabilities 

given there to compute the following probabilities, and explain in words the 

meaning of each one. 

(a) P(AgIA;) 

(b) P(A, NA3IA}) 

(c) P(A2UA3IA1) 

(d) P(A; NAN AZIA; UA2UA3) 

Three plants manufacture hard drives and ship them to a warehouse for 

distribution. Plant I produces 54% of the warehouse’s inventory with a 4% 

defect rate. Plant IT produces 35% of the warehouse’s inventory with an 8% 

defect rate. Plant II] produces the remainder of the warehouse’s inventory with 

a 12% defect rate. 

(a) Draw a tree diagram to represent this information. 

(b) A warehouse inspector selects one hard drive at random. What is the 
probability that it is a defective hard drive and from Plant II? 

(c) What is the probability that a randomly selected hard drive is defective? 

(d) Suppose a hard drive is defective. What is the probability that it came from 
Plant I? 
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61. For any events A and B with P(B) > 0, show that P(AIB) + P(A'IB) = 1. 

62. If P(BIA) > P(B) show that P(B’IA) < P(B’). [Hint: Add P(B'IA) to both sides of 
the given inequality and then use the result of the previous exercise. ] 

63. Show that for any three events A, B, and C with P(C) > 0, P(A UBIC) = P(AIC) 
+P(BIC) — P(AN BIC). 

64. At a certain gas station, 40% of the customers use regular gas (A;), 35% use 
mid-grade gas (A>), and 25% use premium gas (A3). Of those customers using 
regular gas, only 30% fill their tanks (event B). Of those customers using 
mid-grade gas, 60% fill their tanks, whereas of those using premium, 50% fill 
their tanks. 

(a) What is the probability that the next customer will request mid-grade gas 
and fill the tank (A.B)? 

(b) What is the probability that the next customer fills the tank? 

(c) If the next customer fills the tank, what is the probability that regular gas is 
requested? mid-grade gas? Premium gas? 

65. Suppose a single gene controls the color of hamsters: black (B) is dominant and 
brown (b) is recessive. Hence, a hamster will be black unless its genotype is bb. 
Two hamsters, each with genotype Bb, mate and produce a single offspring. 
The laws of genetic recombination state that each parent is equally likely to 
donate either of its two alleles (B or 5), so the offspring is equally likely to be 
any of BB, Bb, bB, or bb (the middle two are genetically equivalent). 

(a) What is the probability their offspring has black fur? 
(b) Given that their offspring has black fur, what is the probability its genotype 
is Bb? 

66. Refer back to the scenario of the previous exercise. In the figure below, the 
genotypes of both members of Generation I are known, as is the genotype of the 
male member of Generation II. We know that hamster I[2 must be black- 
colored thanks to her father, but suppose that we don’t know her genotype 
exactly (as indicated by B- in the figure). 


Generation I 


1 2 
Generation II Bb (2) 
1 2 


Generation III C) 


(a) What are the possible genotypes of hamster II2, and what are the 
corresponding probabilities? 

(b) If we observe that hamster III] has a black coat (and hence at least one 
B gene), what is the probability her genotype is Bb? 
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(c) If we later discover (through DNA testing on poor little hamster III1) that 
her genotype in BB, what is the posterior probability that her mom is also 
BB? 

Seventy percent of the light aircraft that disappear while in flight in a certain 

country are subsequently discovered. Of the aircraft that are discovered, 60% 

have an emergency locator, whereas 90% of the aircraft not discovered do not 

have such a locator. Suppose a light aircraft has disappeared. 

(a) If it has an emergency locator, what is the probability that it will not be 
discovered? 

(b) If it does not have an emergency locator, what is the probability that it will 
be discovered? 

Components of a certain type are shipped to a supplier in batches of ten. 

Suppose that 50% of all such batches contain no defective components, 30% 

contain one defective component, and 20% contain two defective components. 

Two components from a batch are randomly selected and tested. What are the 

probabilities associated with 0, 1, and 2 defective components being in the 

batch under each of the following conditions? 

(a) Neither tested component is defective. 

(b) One of the two tested components is defective. 

[Hint: Draw a tree diagram with three first-generation branches for the three 
different types of batches. ] 

Show that P(ANM BIC) = P(AIBNC)- P(BIC). 

For customers purchasing a full set of tires at a particular tire store, consider the 

events 

A = {tires purchased were made in the USA} 

B= {purchaser has tires balanced immediately } 

C = {purchaser requests front-end alignment} 

along with A’, B’, and C’. Assume the following unconditional and condi- 
tional probabilities: 


P(A) =.75 P(BIA)= .9 P(BIA’) = .8 P(CIANB)=.8 
P(CIANB’)=.6 P(CIA'NB)=.7— P(CIA‘ NB’) = 3 


(a) Construct a tree diagram consisting of first-, second-, and third-generation 
branches and place an event label and appropriate probability next to each 
branch. 

(b) Compute P(ANBNC). 

(c) Compute P(BNC). 

(d) Compute P(C). 

(e) Compute P(AIB MC), the probability of a purchase of US tires given that 
both balancing and an alignment were requested. 
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A professional organization (for statisticians, of course) sells term life insur- 
ance and major medical insurance. Of those who have just life insurance, 70% 
will renew next year, and 80% of those with only a major medical policy will 
renew next year. However, 90% of policyholders who have both types of policy 
will renew at least one of them next year. Of the policy holders, 75% have term 
life insurance, 45% have major medical, and 20% have both. 
(a) Calculate the percentage of policyholders that will renew at least one policy 
next year. 
(b) If a randomly selected policy holder does in fact renew next year, what is 
the probability that he or she has both life and major medical insurance? 
The Reviews editor for a certain scientific journal decides whether the review 
for any particular book should be short (1-2 pages), medium (3-4 pages) or 
long (5-6 pages). Data on recent reviews indicate that 60% of them are short, 
30% are medium, and the other 10% are long. Reviews are submitted in either 
Word or LaTeX. For short reviews, 80% are in Word, whereas 50% of medium 
reviews and 30% of long reviews are in Word. Suppose a recent review is 
randomly selected. 
(a) What is the probability that the selected review was submitted in Word? 
(b) If the selected review was submitted in Word, what are the posterior 
probabilities of it being short, medium, and long? 
A large operator of timeshare complexes requires anyone interested in making 
a purchase to first visit the site of interest. Historical data indicates that 20% of 
all potential purchasers select a day visit, 50% choose a one-night visit, and 
30% opt for a two-night visit. In addition, 10% of day visitors ultimately make 
a purchase, 30% of night visitors buy a unit, and 20% of those visiting for two 
nights decide to buy. Suppose a visitor is randomly selected and found to have 
bought a timeshare. How likely is it that this person made a day visit? A 
one-night visit? A two-night visit? 
Consider the following information about travelers (based partly on a recent 
Travelocity poll): 40% check work e-mail, 30% use a cell phone to stay 
connected to work, 25% bring a laptop with them, 23% both check work 
e-mail and use a cell phone to stay connected, and 51% neither check work 
e-mail nor use a cell phone to stay connected nor bring a laptop. Finally, 88 out 
of every 100 who bring a laptop check work e-mail, and 70 out of every 
100 who use a cell phone to stay connected also bring a laptop. 
(a) What is the probability that a randomly selected traveler who checks work 
e-mail also uses a cell phone to stay connected? 
(b) What is the probability that someone who brings a laptop on vacation also 
uses a cell phone to stay connected? 
(c) If arandomly selected traveler checked work e-mail and brought a laptop, 
what is the probability that s/he uses a cell phone to stay connected? 
There has been a great deal of controversy over the last several years regarding 
what types of surveillance are appropriate to prevent terrorism. Suppose a 
particular surveillance system has a 99% chance of correctly identifying a 
future terrorist and a 99.9% chance of correctly identifying someone who is 
not a future terrorist. Imagine there are 1000 future terrorists in a population of 
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300 million (roughly the US population). If one of these 300 million people is 
randomly selected and the system determines him/her to be a future terrorist, 
what is the probability the system is correct? Does your answer make you 
uneasy about using the surveillance system? Explain. 

At a large university, in the never-ending quest for a satisfactory textbook, the 
Statistics Department has tried a different text during each of the last three 
quarters. During the fall quarter, 500 students used the text by Professor Mean; 
during the winter quarter, 300 students used the text by Professor Median; and 
during the spring quarter, 200 students used the text by Professor Mode. A 
survey at the end of each quarter showed that 200 students were satisfied with 
Mean’s book, 150 were satisfied with Median’s book, and 160 were satisfied 
with Mode’s book. If a student who took statistics during one of these quarters 
is selected at random and admits to having been satisfied with the text, is the 
student most likely to have used the book by Mean, Median, or Mode? Who is 
the least likely author? [Hint: Draw a tree-diagram or use Bayes’ theorem.] 

A friend who lives in Los Angeles makes frequent consulting trips to 
Washington, D.C.; 50% of the time she travels on airline #1, 30% of the time 
on airline #2, and the remaining 20% of the time on airline #3. For airline #1, 
flights are late into D.C. 30% of the time and late into L.A. 10% of the time. For 
airline #2, these percentages are 25% and 20%, whereas for airline #3 the 
percentages are 40% and 25%. If we learn that on a particular trip she arrived 
late at exactly one of the two destinations, what are the posterior probabilities 
of having flown on airlines #1, #2, and #3? Assume that the chance of a late 
arrival in L.A. is unaffected by what happens on the flight to D.C. [Hint: From 
the tip of each first-generation branch on a tree diagram, draw three second- 
generation branches labeled, respectively, 0 late, 1 late, and 2 late.] 

In Exercise 64, consider the following additional information on credit card 
usage: 

70% of all regular fill-up customers use a credit card. 

50% of all regular non-fill-up customers use a credit card. 

60% of all mid-grade fill-up customers use a credit card. 

50% of all mid-grade non-fill-up customers use a credit card. 

50% of all premium fill-up customers use a credit card. 

40% of all premium non-fill-up customers use a credit card. 


Compute the probability of each of the following events for the next cus- 
tomer to arrive (a tree diagram might help). 
(a) {mid-grade and fill-up and credit card} 
(b) {premium and non-fill-up and credit card} 
(c) {premium and credit card} 
(d) {fill-up and credit card} 
(e) {credit card} 
(f) If the next customer uses a credit card, what is the probability that s/he 
purchased premium gasoline? 
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The definition of conditional probability enables us to revise the probability P(A) 
originally assigned to A when we are subsequently informed that another event 
B has occurred; the new probability of A is P(AIB). In our examples, it was 
frequently the case that P(AIB) was unequal to the unconditional probability P(A), 
indicating that the information “B has occurred” resulted in a change in the chance 
of A occurring. There are other situations, however, in which the chance that 
A will occur or has occurred is not affected by knowledge that B has occurred, so 
that P(AIB) = P(A). It is then natural to think of A and B as independent events, 
meaning that the occurrence or nonoccurrence of one event has no bearing on the 
chance that the other will occur. 


DEFINITION 
Two events A and B are independent if P(AIB) = P(A) and are dependent 
otherwise. 


The definition of independence might seem “unsymmetrical” because we do not 
demand that P(BIA)=P(B) also. However, using the definition of conditional 
probability and the Multiplication Rule, 


P(A | B)P(B 
p(B | 4) = "Coe ( 18) 8) (1.6) 


The right-hand side of Eq. (1.6) is P(B) if and only if P(AIB) = P(A) (indepen- 
dence), so the equality in the definition implies the other equality (and vice versa). It 
is also straightforward to show that if A and B are independent, then so are the 
following pairs of events: (1) A’ and B, (2) A and B’, and (3) A’ and B’. See Exercise 
82. 


Example 1.34 Consider an ordinary deck of 52 cards comprising the four suits 
spades, hearts, diamonds, and clubs, with each suit consisting of the 13 ranks ace, 
king, queen, jack, ten, ..., and two. Suppose someone randomly selects a card from 
the deck and reveals to you that it is a picture card (that is, a king, queen, or jack). 
What now is the probability that the card is a spade? If we let A= {spade} and B= 
{face card}, then P(A) = 13/52, P(B) = 12/52 (there are three face cards in each of 
the four suits), and P(A 1B) = P(spade and face card) = 3/52. Thus 


P(ANB) 3/52 3 1 = 13 
P(B) 12/52. 12 -«4~=«*SS 


P(A|B)= 


Therefore, the likelihood of getting a spade is not affected by knowledge that a 
face card had been selected. Intuitively this is because the fraction of spades among 
face cards (3 out of 12) is the same as the fraction of spades in the entire deck 
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(13 out of 52). It is also easily verified that P(BIA) = P(B), so knowledge that a 
spade has been selected does not affect the likelihood of the card being a jack, 
queen, or king. a 


Example 1.35 Consider a gas station with six pumps numbered 1, 2, ..., 6 and let 
E; denote the simple event that a randomly selected customer uses pump 7. Suppose 
that 


P(E.) = P(E6) =.10, P(E,) = P(Es) =.15, P(E3) = P(E4) = .25 
Define events A, B, C by 


A = {2,4,6},B = {1,2,3},C = {2,3,4, 5} 


It is easy to determine that P(A) = .50, P(AIB) = .30, and P(AIC) = .50. There- 
fore, events A and B are dependent, whereas events A and C are independent. 
Intuitively, A and C are independent because the relative division of probability 
among even- and odd-numbered pumps is the same among pumps 2, 3, 4, 5 as it is 
among all six pumps. a 


Example 1.36 Let A and B be any two mutually exclusive events with P(A) > 0. 
For example, for a randomly chosen automobile, let A= {car is blue} and B = {car 
is red}. Since the events are mutually exclusive, if B occurs, then A cannot possibly 
have occurred, so P(AIB)=04 P(A). The message here is that if two events are 
mutually exclusive, they cannot be independent. When A and B are mutually 
exclusive, the information that A occurred says something about the chance of 
B (namely, it cannot have occurred), so independence is precluded. a 


1.5.1 P(AMB) When Events Are Independent 


Frequently the nature of an experiment suggests that two events A and B should be 
assumed independent. This is the case, for example, if a manufacturer receives a 
circuit board from each of two different suppliers, each board is tested on arrival, 
and A = {first is defective} and B= {second is defective}. If P(A) =.1, it should 
also be the case that P(AIB)=.1; knowing the condition of the second board 
shouldn’t provide information about the condition of the first. Our next result 
shows how to compute P(A MB) when the events are independent. 


PROPOSITION 
A and B are independent if and only if 


P(ANB) = P(A) - P(B) (1.7) 
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Proof By the Multiplication Rule, P(ANB)=P(AIB)-P(B), and this equals 
P(A) - P(B) if and only if P(AIB) = P(A). a 


Because of the equivalence of independence with Eq. (1.7), the latter can be used 
as a definition of independence.’ 


Example 1.37 It is known that 30% of a certain company’s washing machines 
require service while under warranty, whereas only 10% of its dryers need such 
service. If someone purchases both a washer and a dryer made by this company, 
what is the probability that both machines need warranty service? 

Let A denote the event that the washer needs service while under warranty, and 
let B be defined analogously for the dryer. Then P(A) = .30 and P(B) = .10. Assum- 
ing that the two machines function independently of each other, the desired 
probability is 


P(ANB) = P(A) - P(B) = (.30)(.10) = .03 


The probability that neither machine needs service is 


P(A’ B’) = P(A’) - P(B’) = (.70)(.90) = .63 


Note that, although the independence assumption is reasonable here, it can be 
questioned. In particular, if heavy usage causes a breakdown in one machine, it 
could also cause trouble for the other one. a 


Example 1.38 Each day, Monday through Friday, a batch of components sent by a 
first supplier arrives at a certain inspection facility. Two days a week, a batch also 
arrives from a second supplier. Eighty percent of all supplier 1’s batches pass 
inspection, and 90% of supplier 2’s do likewise. What is the probability that, on a 
randomly selected day, two batches pass inspection? We will answer this assuming 
that on days when two batches are tested, whether the first batch passes is indepen- 
dent of whether the second batch does so. Figure 1.14 displays the relevant 
information. 


' However, the multiplication property is satisfied if P(B) = 0, yet P(AIB) is not defined in this case. 
To make the multiplication property completely equivalent to the definition of independence, we 
should append to that definition that A and B are also independent if either P(A) = 0 or P(B) = 0. 
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4x (.8 x .9) 


Fig. 1.14 Tree diagram for Example 1.38 


P(two pass) = P(two received N both pass) 
= P(both pass|two received) - P(two received) 


[(.8)(.9)](.4) = .288 a 


1.5.2 Independence of More than Two Events 


The notion of independence can be extended to collections of more than two events. 
Although it is possible to extend the definition for two independent events by 
working in terms of conditional and unconditional probabilities, it is more direct 
and less cumbersome to proceed along the lines of the last proposition. 


DEFINITION 
Events Aj, ..., A, are mutually independent if for every k (k=2, 3, ..., n) 
and every subset of indices 7), iz, ...., ix, 


P(Aj, MA (eon NAi,) = Pa) - P(A;,) cue) P(A.) 


To paraphrase the definition, the events are mutually independent if the proba- 
bility of the intersection of any subset of the 1 events is equal to the product of the 
individual probabilities. In using this multiplication property for more than two 
independent events, it is legitimate to replace one or more of the A;,s by their 
complements (e.g., if Ay, Az, and A3 are independent events, then so are A‘, A’, 
and A’3.) As was the case with two events, we frequently specify at the outset of a 
problem the independence of certain events. The definition can then be used to 
calculate the probability of an intersection. 
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Example 1.39 The article “Reliability Evaluation of Solar Photovoltaic Arrays” 
(Solar Energy, 2002: 129-141) presents various configurations of solar photovol- 
taic arrays consisting of crystalline silicon solar cells. Consider first the system 
illustrated in Fig. 1.15a. There are two subsystems connected in parallel, each one 
containing three cells. In order for the system to function, at least one of the two 
parallel subsystems must work. Within each subsystem, the three cells are 
connected in series, so a subsystem will work only if all cells in the subsystem 
work. Consider a particular lifetime value fo, and suppose we want to determine the 
probability that the system lifetime exceeds fo. Let A; denote the event that the 
lifetime of cell i exceeds to ((= 1, 2, ..., 6). We assume that the A;s are independent 
events (whether any particular cell lasts more than f) hours has no bearing on 
whether any other cell does) and that P(A;)=.9 for every i since the cells are 
identical. Then applying the Addition Rule followed by independence, 


P(system lifetime exceeds to) = PI(Ai MA2NA3) U (Aq NA5N Ao) | 
= P(A, NA, NA3) + P(AgN As AG) 
—P[(A1 NA2MA3) N (Ag N As NAyo) | 
= (.9)(.9)(.9) + (.9)(.9)(.9) 
=(.9)(.9)(.9)(.9)(.9)(.9) 
= .927 


Alternatively, 


P(system lifetime exceeds to) = 1 — P(both subsystem lives are < fo) 
= | — [P(subsystem life is < t)|° 
= 1 —[1 — P(subsystem life is > to)|? 


= [1 = (99°) = 927 


Next consider the total-cross-tied system shown in Fig. 1.15b, obtained from the 
series—parallel array by connecting ties across each column of junctions. Now the 
system fails as soon as an entire column fails, and system lifetime exceeds fp only if 
the life of every column does so. For this configuration, 

P(system lifetime exceeds to) = [P(column lifetime exceeds to) | 

[1 — P(column lifetime is < f)}* 

= [1 — P(both cells in a column have lifetime < to)}* 
=1- 


[1-¢ 97)" =.970 


a HH: b JijLieLis 


4 a) 6 4 5 6 


Fig. 1.15 System configurations for Example 1.39: (a) series—parallel; (b) total-cross-tied ll 
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Probabilities like those calculated in Example 1.39 are often referred to as the 
reliability of a system. In Sect. 4.8, we consider in more detail the analysis of 
system reliability. 


1.5.3 Exercises: Section 1.5 (79-100) 
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Reconsider the credit card scenario of Exercise 52, and show that A and B are 

dependent first by using the definition of independence and then by verifying 

that the multiplication property does not hold. 

An oil exploration company currently has two active projects, one in Asia and 

the other in Europe. Let A be the event that the Asian project is successful and 

B be the event that the European project is successful. Suppose that A and 

B are independent events with P(A) = .4 and P(B) =.7. 

(a) If the Asian project is not successful, what is the probability that the 
European project is also not successful? Explain your reasoning. 

(b) What is the probability that at least one of the two projects will be 
successful? 

(c) Given that at least one of the two projects is successful, what is the 
probability that only the Asian project is successful? 

In Exercise 15, is any A; independent of any other A;? Answer using the 

multiplication property for independent events. 

If A and B are independent events, show that A’ and B are also independent. 

[Hint: First use a Venn diagram to establish a relationship among P(A’ B), 

P(B), and P(ANMB).] 

Suppose that the proportions of blood phenotypes in a particular population 

are as follows: 


A B AB O 
40 oll. fl 04 AS 


Assuming that the phenotypes of two randomly selected individuals are 
independent of each other, what is the probability that both phenotypes are O? 
What is the probability that the phenotypes of two randomly selected 
individuals match? 

The probability that a grader will make a marking error on any particular 
question of a multiple-choice exam is .1. If there are ten questions on the exam 
and questions are marked independently, what is the probability that no errors 
are made? That at least one error is made? If there are n questions on the exam 
and the probability of a marking error is p rather than .1, give expressions for 
these two probabilities. 

In October, 1994, a flaw in a certain Pentium chip installed in computers was 
discovered that could result in a wrong answer when performing a division. 
The manufacturer initially claimed that the chance of any particular division 
being incorrect was only | in 9 billion, so that it would take thousands of years 
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before a typical user encountered a mistake. However, statisticians are not 
typical users; some modern statistical techniques are so computationally 
intensive that a billion divisions over a short time period is not unrealistic. 
Assuming that the | in 9 billion figure is correct and that results of divisions 
are independent from one another, what is the probability that at least one 
error occurs in | billion divisions with this chip? 
An aircraft seam requires 25 rivets. The seam will have to be reworked if any 
of these rivets is defective. Suppose rivets are defective independently of one 
another, each with the same probability. 
(a) If 20% of all seams need reworking, what is the probability that a rivet is 
defective? 
(b) How small should the probability of a defective rivet be to ensure that 
only 10% of all seams need reworking? 
A boiler has five identical relief valves. The probability that any particular 
valve will open on demand is .95. Assuming independent operation of the 
valves, calculate P(at least one valve opens) and P(at least one valve fails 
to open). 
Two pumps connected in parallel fail independently of each other on any 
given day. The probability that only the older pump will fail is .10, and the 
probability that only the newer pump will fail is .05. What is the probability 
that the pumping system will fail on any given day (which happens if both 
pumps fail)? 
Consider the system of components connected as in the accompanying pic- 
ture. Components | and 2 are connected in parallel, so that subsystem works 
iff either 1 or 2 works; since 3 and 4 are connected in series, that subsystem 
works iff both 3 and 4 work. If components work independently of one 
another and P(component works) = .9, calculate P(system works). 


i 


3 4 


Refer back to the series—parallel system configuration introduced in Example 
1.39, and suppose that there are only two cells rather than three in each 
parallel subsystem [in Fig. 1.15a, eliminate cells 3 and 6, and renumber 
cells 4 and 5 as 3 and 4]. Using P(A;) = .9, the probability that system lifetime 
exceeds fy is easily seen to be .9639. To what value would .9 have to be 
changed in order to increase the system lifetime reliability from .9639 to .99? 
[Hint: Let P(A;)=p, express system reliability in terms of p, and then let 
x=p’] 

Consider independently rolling two fair dice, one red and the other green. Let 
A be the event that the red die shows 3 dots, B be the event that the green die 
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shows 4 dots, and C be the event that the total number of dots showing on the 

two dice is 7. 

(a) Are these events pairwise independent (i.e., are A and B independent 
events, are A and C independent, and are B and C independent)? 

(b) Are the three events mutually independent? 

Components arriving at a distributor are checked for defects by two different 

inspectors (each component is checked by both inspectors). The first inspector 

detects 90% of all defectives that are present, and the second inspector does 
likewise. At least one inspector fails to detect a defect on 20% of all defective 
components. What is the probability that the following occur? 

(a) A defective component will be detected only by the first inspector? By 
exactly one of the two inspectors? 

(b) All three defective components in a batch escape detection by both 
inspectors (assuming inspections of different components are independent 
of one another)? 

Seventy percent of all vehicles examined at a certain emissions inspection 

station pass the inspection. Assuming that successive vehicles pass or fail 

independently of one another, calculate the following probabilities: 

(a) P(all of the next three vehicles inspected pass) 

(b) P(at least one of the next three inspected fails) 

(c) P(exactly one of the next three inspected passes) 

(d) P(at most one of the next three vehicles inspected passes) 

(e) Given that at least one of the next three vehicles passes inspection, what is 
the probability that all three pass (a conditional probability)? 

A quality control inspector is inspecting newly produced items for faults. The 

inspector searches an item for faults in a series of independent fixations, each 

of a fixed duration. Given that a flaw is actually present, let p denote the 
probability that the flaw is detected during any one fixation (this model is 
discussed in “Human Performance in Sampling Inspection,’ Human Factors, 

1979: 99-105). 

(a) Assuming that an item has a flaw, what is the probability that it is detected 
by the end of the second fixation (once a flaw has been detected, the 
sequence of fixations terminates)? 

(b) Give an expression for the probability that a flaw will be detected by the 
end of the nth fixation. 

(c) If when a flaw has not been detected in three fixations, the item is passed, 
what is the probability that a flawed item will pass inspection? 

(d) Suppose 10% of all items contain a flaw [P(randomly chosen item is 
flawed) = .1]. With the assumption of part (c), what is the probability that 
a randomly chosen item will pass inspection (it will automatically pass if 
it is not flawed, but could also pass if it is flawed)? 

(e) Given that an item has passed inspection (no flaws in three fixations), what 
is the probability that it is actually flawed? Calculate for p=.5. 

(a) A lumber company has just taken delivery on a lot of 10,000 2 x 4 boards. 
Suppose that 20% of these boards (2000) are actually too green to be 
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used in first-quality construction. Two boards are selected at random, one 
after the other. Let A = {the first board is green} and B = {the second board 
is green}. Compute P(A), P(B), and P(AMB) (a tree diagram might help). 
Are A and B independent? 

With A and B independent and P(A) = P(B) =.2, what is P(ANB)? How 
much difference is there between this answer and P(A M B) in part (a)? For 
purposes of calculating P(AMB), can we assume that A and B of part 
(a) are independent to obtain essentially the correct probability? 
Suppose the lot consists of ten boards, of which two are green. Does the 
assumption of independence now yield approximately the correct answer 
for P(ANB)? What is the critical difference between the situation here 
and that of part (a)? When do you think that an independence assumption 
would be valid in obtaining an approximately correct answer to P(AM B)? 
Refer to the assumptions stated in Exercise 89 and answer the question posed 
there for the system in the accompanying picture. How would the probability 
change if this were a subsystem connected in parallel to the subsystem 
pictured in Fig. 1.15a? 
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Professor Stander Deviation can take one of two routes on his way home from 

work. On the first route, there are four railroad crossings. The probability that 

he will be stopped by a train at any particular one of the crossings is .1, and 

trains operate independently at the four crossings. The other route is longer 

but there are only two crossings, independent of each other, with the same 

stoppage probability for each as on the first route. On a particular day, 

Professor Deviation has a meeting scheduled at home for a certain time. 

Whichever route he takes, he calculates that he will be late if he is stopped 

by trains at at least half the crossings encountered. 

(a) Which route should he take to minimize the probability of being late to the 
meeting? 

(b) If he tosses a fair coin to decide on a route and he is late, what is the 
probability that he took the four-crossing route? 

For a customer who test drives three vehicles, define events A; = customer 

likes vehicle #i for i=1, 2, 3. Suppose that P(A,)=.55, P(A2)=.65, 

P(A3) =.70, P(A; UA2) = .80, P(A2 NA3) = .40, and P(A, UA, UA3) = .88. 

(a) What is the probability that a customer likes both vehicle #1 and vehicle 
#2? 

(b) Determine and interpret P(A.IA3). 

(c) Are Az and A; independent events? Answer in two different ways. 

(d) If you learn that the customer did not like vehicle #1, what now is the 
probability that s/he liked at least one of the other two vehicles? 
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99. It’s a commonly held misconception that if you play the lottery n times, and 
the probability of winning each time is 1/N, then your chance of winning at 
least once is n/N. That’s true if you buy zn tickets in 1 week, but not if you buy 
a single ticket in each of n independent weeks. Let’s explore further. 

(a) Suppose you play a game n independent times, with P(win) = 1/N each 
time. Find an expression for the probability you win at least once. [Hint: 
Consider the complement. ] 

(b) How does your answer to (a) compare to n/N for the easy task of rolling a 
El on a fair die (so 1/N = 1/6) inn=3 tries? In n= 6 tries? In n= 10 tries? 

(c) How does your answer to (a) compare to n/N in the setting of Exercise 85: 
probability = 1 in 9 billion, number of tries = 1 billion? 

(d) Show that when n is much smaller than N, the fraction n/N is not a bad 
approximation to (a). [Hint: Use the binomial theorem from high school 
algebra. ] 

100. Suppose identical tags are placed on both the left ear and the right 
ear of a fox. The fox is then let loose for a period of time. Consider 
the two events C, = {left ear tag is lost} and C2 = {right ear tag is lost}. 
Let p = P(C,) = P(C2), and assume C, and C32 are independent events. Derive 
an expression (involving p) for the probability that exactly one tag is lost, 
given that at most one is lost (“Ear Tag Loss in Red Foxes,” J. Wildlife 
Manag., 1976: 164-167). [Hint: Draw a tree diagram in which the two initial 
branches refer to whether the left ear tag was lost.] 


1.6 Simulation of Random Events 


As probability models in engineering and the sciences have grown in complexity, 
many problems have arisen that are too difficult to attack “analytically,” i-e., using 
mathematical tools such as those in the previous sections. Instead, computer 
simulation provides us an effective way to estimate probabilities of very compli- 
cated events (and, in later chapters, of other properties of random phenomena). 
Here we introduce the principles of probability simulation, demonstrate a few 
examples with Matlab and R code, and discuss the precision of simulated 
probabilities. 

Suppose an investigator wishes to determine P(A), but either the experiment on 
which A is defined or the A event itself is so complicated as to preclude the use of 
probability rules and properties. The general method for estimating this probability 
via simulation software is as follows: 


— Write a program that simulates (mimics) the underlying random experiment. 
— Run the program many times, with each run independent of all others. 
— During each run, record whether or not the event A of interest occurs. 


If the simulation is run a total of n independent times, then the estimate of P(A), 
denoted by P(A), is 
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5,,, __ number of times A occurs __n(A) 


P(A) 


number of runs n 


For example, if we run a simulation program 10,000 times and the event of 
interest A occurs in 6174 of those runs, then our estimate of P(A) is 
P(A) = 6174/10,000 = .6174. Notice that our definition is consistent with the 
long-run relative frequency interpretation of probability discussed in Sect. 1.2. 


1.6.1 The Backbone of Simulation: Random Number Generators 


All modern software packages are equipped with a function called a random 

number generator (RNG). A typical call to this function (such as ran or rand) 

will return a single, supposedly “random” number, though such functions typically 
permit the user to request a vector or even a matrix of “random” numbers. It is more 

proper to call these results pseudo-random numbers, since there is actually a 

deterministic (i.e., non-random) algorithm by which the software generates these 

values. We will not discuss the details of such algorithms here; see the book by Law 
listed in the references. What will matter to us are the following two characteristics: 

1. Each number created by an RNG is as likely to be any particular number in the 
interval [0, 1) as it is to be any other number in this interval (up to computer 
precision, anyway).” 

2. Successive values created by RNGs are independent, in the sense that we cannot 
predict the next value to be generated from the current value (unless we 
somehow know the exact parameters of the underlying algorithm). 

A typical simulation program manipulates numbers on the interval [0, 1) in a 
way that mimics the experiment of interest; several examples are provided below. 
Arguably the most important building block for such programs is the ability to 
simulate a basic event that occurs with a known probability, p. Since RNGs produce 
values equally likely to be anywhere in the interval [0, 1), it follows that in the long 
run a proportion p of them will lie in the interval [0, p). So, suppose we need to 
simulate an event B with P(B) =p. In each run of our simulation program, we can 
call for a single “random” number, which we’ll call u, and apply the following 
tules: 


— IfO0<u<p, then event B has occurred on this run of the program. 
— Ifp<u< 1, then event B has not occurred on this run of the program. 


*TIn the language of Chap. 3, the numbers produced by an RNG follow essentially a uniform 
distribution on the interval [0, 1). 
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Example 1.40 Let’s begin with an example in which the exact probability can be 
obtained analytically, so that we may verify that our simulation method works. 
Suppose we have two independent devices which function with probabilities .6 and 
.7, respectively. What is the probability both devices function? That at least one 
device functions? 

Let B, and B> denote the events that the first and second devices function, 
respectively; we know that P(B,) =.6, P(B2) =.7, and B, and Bz are independent. 
Our first goal is to estimate the probability of A=B, Bo, the event that both 
devices function. The following “pseudo-code” will allow us to find P (A). 

0. Set a counter for the number of times A occurs to zero. 

Repeat n times: 

1. Generate two random numbers, uw, and u2. (These will help us determine whether 

B, and Bz occur, respectively.) 

2. If uy <.6 AND uz <.7, then A has occurred. Add 1 to the count of occurrences 

of A. 


Once the n runs are complete, then P(A) = (count of the occurrences of A)/n. 

Figure 1.16 shows actual implementation code in both Matlab and R. We ran 
each program with n= 10,000 (as in the code); the event A occurred 4215 times in 
Matlab and 4181 times in R, providing estimated probabilities of 
P(A) = .4215 and .4181, respectively. Compare this to the exact probability of A: 
by independence, P(A) = P(B,)P(B2) =(.6)(.7)=.42. Both of our simulation 
estimates were “in the ballpark” of the right answer. We’ll discuss the precision 
of these estimates shortly. 

By replacing the “and” operators && in Fig. 1.16 with “or” operators II, we can 
estimate the probability at least one device functions, P(B, U B32). In one simulation 
(again with n = 10,000), the event B, UB, occurred 8,802 times, giving the estimate 


P(B, UB) = .8802. This is quite close to the exact probability: 


P(B, UB) = 1 — P(B, NB) es eae eee 


a b 

A=0; A<-0 

for i=1:10000 for(i in 2:10000)4 
ul=rand; u2=rand; ul<-runif (1); u2<-runif (1) 
if ul<.6 && u2<.7 if(ul<.6 && u2<.7) { 

A=A+1; A<-A+l1 

end } 

end } 


Fig. 1.16 Code for Example 1.40: (a) Matlab; (b) R a 
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Example 1.41 Consider the following game: You'll flip a coin 25 times, winning 
$1 each time it lands heads (H) and losing $1 each time it lands tails (7). Unfortu- 
nately for you, the coin is biased in such a way that P(#H7) = .4 and P(T) = .6. What’s 
the probability you come out ahead, i.e., you have more money at the end of the 
game than you had at the beginning? We’ll use simulation to find out. 

Now each run of the simulation requires 25 “random” objects: the results of the 
25 coin tosses. What’s more, we need to keep track of how much money we’ve won 
or lost at the end of the 25 tosses. Let A= {we come out ahead}, and use the 
following pseudo-code: 

0. Set a counter for the number of times A occurs to zero. 
Repeat n times: 
. Set your initial dollar amount to zero. 
. Generate 25 random numbers uy, ..., 25. 
3. For each u; <.4, heads was tossed, so add | to your dollar amount. For each 

u; > .4, the flip was tails and 1 is deducted. 

4. If the final dollar amount is positive (i.e., $1 or greater), add 1 to the count of 

occurrences for A. 


Noe 


Once the n runs are complete, then P(A) = (count of the occurrences of A) /n. 
Matlab and R code for Example 1.41 appear in Fig. 1.17. Our R code gave a final 
count of 1,567 occurrences of A, out of 10,000 runs. Thus, the estimated probability 


that we come out ahead in this game is P(A) = 1567/10,000 = .1567. 


a 
A=0; A <- 0 
for i=1:10000 for (i in 1:10000) { 
dollar=0; dollar<-0 
for j=1:25 for (3 in 1:25) { 
u=rand; u<-runif (1) 
if u<.4 if (u<.4) { 
dollar=dollartl; dollar<-dollar+l 
else } 
dollar=dollar-1; else{dollar<-dollar-1} 
end } 
end if (dollar>0) { 
if dollar>0 A<-At+1 
A=A+1; } 
end } 
end 
Fig. 1.17 Code for Example 1.41: (a) Matlab; (b) R a 


Throughout this textbook, we will illustrate repeated simulation through “for” 
loops, as in Figs. 1.16 and 1.17. Though this isn’t necessarily the most efficient way 
to code these examples, we do so for clarity’s sake. Readers familiar with basic 
programming may realize that such operations can be sped up by vectorization, 1.e., 
by using a function call that generates all the required random numbers simulta- 
neously, rather than one at a time. Similarly, the if/else statements used in the 
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preceding programs to determine whether a random number lies in an interval can 
be rewritten in terms of true/false bits, which automatically generate a 1 if a 
statement is true and a 0 otherwise. For example, the Matlab code 

TE U< 25 

A=A+1; 

end 

can be replaced by the single line of code 

A=A+(u<.5); 

If the statement in parentheses is true, Matlab assigns a value | to (u<.5), and 
so | is added to the count A. Similar code works in R. 

The previous two examples have both assumed independence of certain events: 
the functionality of neighboring devices, or the outcomes of successive coin flips. 
With the aid of some built-in packages within Matlab and R, we can also simulate 
counting experiments similar to those in Sect. 1.3, even though draws without 
replacement from a finite population are not independent. To illustrate, let’s use 
simulation to estimate some of the combinatorial probabilities from Sect. 1.3. 


Example 1.42 Consider again the situation presented in Example 1.24: A univer- 
sity warehouse has received a shipment of 25 printers, of which 10 are laser printers 
and 15 are inkjet models; a particular technician will check 6 of these 25 printers, 
selected at random. Of interest is the probability of the event D3 = {exactly 3 of the 
6 selected are inkjet printers}. Although the initial probability of selecting an inkjet 
printer is 15/25, successive selections are not independent (the conditional proba- 
bility that the next printer is also an inkjet is not 15/25). So, the method of the 
preceding examples does not apply. 

Instead, we use the sampling tool built into our software, as follows: 

0. Set a counter for the number of times D3 occurs to zero. 

Repeat n times: 

1. Sample 6 numbers, without replacement, from the integers | through 25. (1-15 
correspond to the labels for the 15 inkjet printers and 16—25 identify the 10 laser 
printers.) 

2. Count how many of these 6 numbers fall between | and 15, inclusive. 

3. If exactly 3 of these 6 numbers fall between 1 and 15, add 1 to the count of 
occurrences for D3. 

Once the n runs are complete, then P(D3) = (count of the occurrences of D3) /n. 

Matlab and R code for this example appear in Fig. 1.18. Vital to the execution of 
this simulation is the fact that both software packages have a built-in mechanism for 
randomly sampling without replacement from a finite set of objects (here, the 
integers 1—25). For more information on these functions, type help randsample 
in Matlab or help(sample) inR. 

In both sets of code, the line sum(printers<=15) performs two actions. 
First, printers<=15 converts each of the 6 numbers in the vector printers 
into a | if the entry is between | and 15 (and into a 0 otherwise). Second, sum ( ) 
adds up the 1s and Os, which is equivalent to identifying how many Is appear (.e., 
how many of the 6 numbers fell between | and 15). 
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a b 

D=0; D<-0 

for i=1:10000 for (i in 1:10000) { 
printers=randsample (25,6); printers<-sample (25, 6) 
inkjet=sum (printers<=15) ; inkjet<-sum(printers<=15) 
if inkjet== if (inkjet==3) { 

D=D+1; D<-D+1 

end } 

end } 


Fig. 1.18 Matlab and R code for Example 1.42 


Our R_ code resulted in event D; occurring 3054 times, so 
P(D3) = 3054/10,000 = .3054, which is quite close to the “exact” answer of 
.3083 found in Example 1.24. The other probability of interest, the chance of 
randomly selecting at least 3 inkjet printers, can be estimated by modifying one 
line of code: change inkjet==3 to inkjet>=3. One simulation provided a 
count of 8522 occurrences in 10,000 trials, for an estimated probability of .8522 
(close to the combinatorial solution of .8530). | 


1.6.2 Precision of Simulation 


In Example 1.40, we gave two different estimates P(A) for a probability P(A). 
Which is more “correct”? Without knowing P(A) itself, there’s no way to tell. 
However, thanks to the theory we will develop in subsequent chapters, we can 
quantify the precision of simulated probabilities. Of course, we must have written 
code that faithfully simulates the random experiment of interest. Further, we 
assume that the results of each single run of our program are independent of the 
results of all other runs. (This generally follows from the aforementioned indepen- 
dence of computer-generated random numbers.) 

If this is the case, then a measure of the disparity between the true probability P(A) 
and the estimated probability P(A) based on n runs of the simulation is given by: 


P(A) [1 = P(A)] 


n 


(1.8) 


This measure of precision is called the (estimated) standard error of the 
estimate P(A); see Sect. 2.4 for a derivation. Expression (1.8) tells us that the 
amount by which P(A) typically differs from P(A) depends upon two values: P(A) 
itself, and the number of runs n. You can make sense of the former this way: if P(A) 
is very small, then P(A) will presumably be small as well, in which case they cannot 
deviate by very much since both are bounded below by zero. (Standard error 
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quantifies the absolute difference between them, not the relative difference.) A 
similar comment applies if P(A) is very large, i.e., near 1. 

As for the relationship to n, Expression (1.8) indicates that the amount by which 
P(A) will typically differ from P(A) is proportional to the reciprocal of the square 
root of n. So, in particular, as n increases the tendency is for P(A) to vary less and 
less. This speaks to the precision of P(A): our estimate becomes more precise as 
n increases, but not at a very fast rate. 

Let’s think a bit more about this relationship: suppose your simulation results 
thus far were too imprecise for your tastes. By how much would you have to 
increase the number of runs to gain one additional decimal place of precision? 
That’s equivalent to reducing the estimated standard error by a factor of 10. Since 
precision is proportional to 1/,/n, you would need to increase n by a factor of 100 to 
achieve the desired improvement, e.g., if using m = 10,000 runs is insufficient for 
your purposes, then you’ll need 1,000,000 runs to get one additional decimal place 
of precision. Typically, this will mean running your program 100 times longer—not 
a big deal if 10,000 runs only take a nanosecond but prohibitive if they require, say, 
an hour. 


Example 1.43 (Example 1.41 continued) Based on n = 10,000 runs, we estimated 


the probability of coming out ahead in a certain game to be P(A) = .1567. 
Substituting into Eq. (1.8), we get 


1567/1 — .1567] 


10,000 + 26 


This is the (estimated) standard error of our estimate .1567. We interpret as 
follows: some simulation experiments with n= 10,000 will result in an estimated 
probability that is within .0036 of the actual probability, whereas other such 
experiments will give an estimated probability that deviates by more than .0036 
from the actual P(A); .0036 is roughly the size of a typical deviation between the 
estimate and what it is estimating. a 


In Chap. 5, we will return to the notion of standard error and develop a so-called 
confidence interval estimate for P(A): a range of numbers in which we can be very 
certain where P(A) lies. 


1.6.3 Exercises: Section 1.6 (101-120) 


101. Refer to Example 1.40. 

(a) Modify the code in Fig. 1.16 to estimate the probability that exactly one 
of the two devices functions properly. Then find the exact probability 
using the techniques from earlier sections of this chapter, and compare it 
to your estimated probability. 
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102. 


103. 


104. 


105. 


106. 


(b) Calculate the estimated standard error for the estimated probability in (a). 
Imagine you have five independently operating components, each working 
properly with probability .8. Use simulation to estimate the probability that 
(a) All five components work properly. 

(b) At least one of the five components works properly. 

{Hints for (a) and (b): You can adapt the code from Example 1.40, but 
the and/or statements will become tedious. Consider using the max and 
min functions instead. ] 

(c) Calculate the estimated standard errors for your answers in (a) and (b). 

Consider the system depicted in Exercise 96. Assume the seven components 

operate independently with the following probabilities of functioning prop- 

erly: .9 for components | and 2; .8 for each of components 3, 4, 5, 6; and .95 

for component 7. Write a program to estimate the reliability of the system 

(i.e., the probability the system functions properly). 

You have an opportunity to answer six trivia questions about your favorite 

sports team, and you will win a pair of tickets to their next game if you can 

correctly answer at least three of the questions. Write a simulation program 
to estimate the chance you win the tickets under each of the following 
assumptions. 

(a) You have a 50-50 chance of getting any question right, independent of 
all others. 

(b) Being a true fan, you have a 75% chance of getting any question right, 
independent of all others. 

(c) The first three questions are fairly easy, so you have a .75 chance of 
getting each of those right. However, the last three questions are much 
harder, and you only have a .3 probability of correctly answering each of 
those. 

In the game “Now or Then” on the television show The Price is Right, the 
contestant faces a wheel with six sectors. Each sector contains a grocery item 
and a price, and the contestant must decide whether the price is “now” (i.e., 
the item’s price the day of the taping) or “then” (the price at some specified 
past date, such as September 2003). The contestant wins a prize (bedroom 
furniture, a Caribbean cruise, etc.) if s/he guesses correctly on three adjacent 
sectors. That is, numbering the sectors 1-6 clockwise, correct guesses on 
sectors 5, 6, and | wins the prize but not on sectors 5, 6, and 3, since the latter 
are not all adjacent. (The contestant gets to guess on all six sectors, if need 
be.) 

Write a simulation program to estimate the probability the contestant wins 
the prize, assuming her/his guesses are independent from item to item. 
Provide estimated probabilities under each of the following assumptions: 
(1) each guess is “wild” and thus has probability .5 of being correct, and 
(2) the contestant is a good shopper, with probability .8 of being correct on 
any item. 

Refer to the game in Example 1.41. Under the same settings as in that 

example, estimate the probability the player is ahead at any time during 
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the 25 plays. [Hint: This occurs if the player’s dollar amount is positive at 

any of the 25 steps in the loop. So, you will need to keep track of every value 

of the dollar variable, not just the final result.] 

Refer again to Example 1.41. Estimate the probability that the player 

experiences a “swing” of at least $5 during the game. That is, estimate the 

chance that the difference between the largest and smallest dollar amounts 
during the game is at least 5. (This would happen, for instance, if the player 
was at one point ahead at +$2 but later fell behind to —$3.) 

Each of this book’s authors has a fair coin. Carlton tosses his coin repeatedly 

until obtaining the sequence HTT. Devore tosses his coin until the sequence 

HTH is obtained. 

(a) Write a program to simulate Carlton’s coin tossing and, separately, 
Devore’s. Your program should keep track of the number of tosses 
each author requires on each simulation run to achieve his target 
sequence. 

(b) Estimate the probability that Devore obtains his sequence with fewer 
tosses than Carlton requires to obtain his sequence. 

There’s a 40-question multiple-choice exam we sometimes administer in our 
lower-level statistics classes. The exam has a peculiar feature: 10 of the 
questions have two options, 13 have three options, 13 have four options, and 
the other 4 have five options. (FYI, this is completely real!) What is the 
probability that, purely by guessing, a student could get at least half of these 
questions correct? Write a simulation program to answer this question. 

Major League Baseball teams play a 162-game season, during which fans are 

often excited by long winning streaks and frustrated by long losing streaks. 

But how unusual are these streaks, really? How long a streak would you 

expect if the team’s performance were independent from game to game? 

Write a program that simulates a 162-game season, Le., a string of 

162 wins and losses, with P(win) =p for each game (the value of p to be 

specified later). Use your program with at least 10,000 runs to answer the 

following questions. 

(a) Suppose you’re rooting for a “.500” team—that is, p=.5. What is the 
probability of observing a streak of at least five wins in a 162-game 
season? Estimate this probability with your program, and include a 
standard error. 

(b) Suppose instead your team is quite good: a .600 team overall, so p=.6. 
Intuitively, should the probability of a winning streak of at least five 
games be higher or lower? Explain. 

(c) Use your program with p=.6 to estimate the probability alluded to in 
(b). Is your answer higher or lower than (a)? Is that what you anticipated? 

A derangement of the numbers | through n is a permutation of all n those 

numbers such that none of them is in the “right place.” For example, 34251 is 

a derangement of | through 5, but 24351 is not because 3 is in the 3rd 

position. We will use simulation to estimate the number of derangements of 

the numbers | through 12. 
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(a) Write a program that generates random permutations of the integers 1, 2, 
..., 12. Your program should determine whether or not each permutation 
is a derangement. 

(b) Based on your program, estimate P(D), where D = {a permutation of 1— 
12 is a derangement}. 

(c) From Sect. 1.3, we know the number of permutations of n items. (How 
many is that for n= 12?) Use this information and your answer to part 
(b) to estimate the number of derangements of the numbers | through 12. 

[Hint for part (a): Use random sampling without replacement as in Exam- 
ple 1.42. Alternatively, the randperm command in Matlab can also be 
employed. ] 

The book’s Introduction discussed the famous Birthday Problem, which was 

solved in Example 1.22 of Sect. 1.3. Now suppose you have 500 Facebook 

friends. Make the same assumptions here as in the Birthday Problem. 

(a) Write a program to estimate the probability that, on at least | day during 
the year, Facebook tells you three (or more) of your friends share that 
birthday. Based on your answer, should you be surprised by this 
occurrence? 

(b) Write a program to estimate the probability that, on at least 1 day during 
the year, Facebook tells you five (or more) of your friends share that 
birthday. Based on your answer, should you be surprised by this 
occurrence? 

[Hint: Generate 500 birthdays with replacement, then determine whether 
any birthday occurs three or more times (five or more for part (b)). The 
table function in R or tabulate in Matlab may prove useful.] 
Consider the following game: you begin with $20. You flip a fair coin, 
winning $10 if the coin lands heads and losing $10 if the coin lands tails. 
Play continues until you either go broke or have $100 (i.e., a net profit of 
$80). Write a simulation program to estimate: 

(a) The probability you win the game. 

(b) The probability the game ends within ten coin flips. 

[Note: This is a special case of the Gambler’s Ruin problem, which we’ ll 
explore in much greater depth in Exercise 145 and again in Chap. 6.] 
Consider the Coupon Collector's Problem described in the Introduction: 
10 different coupons are distributed into cereal boxes, one per box, so that 
any randomly selected box is equally likely to have any of the 10 coupons 
inside. Write a program to simulate the process of buying cereal boxes until 
all 10 distinct coupons have been collected. For each run, keep track of how 
many cereal boxes you purchased to collect the complete set of coupons. 
Then use your program to answer the following questions. 

(a) What is the probability you collect all 10 coupons with just 10 cereal 
boxes? 

(b) Use counting techniques to determine the exact probability in (a). [Hint: 
Relate this to the Birthday Problem. ] 
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(c) What is the probability you require more than 20 boxes to collect all 
10 coupons? 

(d) Using techniques from Chap. 4, it can be shown that it takes about 29.3 
boxes, on the average, to collect all 10 coupons. What’s the probability of 
collecting all 10 coupons in fewer than average boxes (i.e., less than 
29.3)? 

In the Introduction we mentioned a famous puzzle from the early days of 

probability, investigated by Pascal and Fermat. Which of the following 

events is more likely: to roll at least one 4 in four rolls of a fair die, or to 
roll at least one in 24 rolls of two fair dice? 

(a) Write a program to simulate a set of four die rolls many times, and use 
the results to estimate P(at least one & in 4 rolls). 

(b) Now adapt your program to simulate rolling a pair of dice 24 times. 
Repeat this simulation many times, and use your results to estimate 
P(at least one in 24 rolls). 

The Problem of the Points. Pascal and Fermat also explored a question 

concerning how to divide the stakes in a game that has been interrupted. 

Suppose two players, Blaise and Pierre, are playing a game where the winner 

is the first to achieve a certain number of points. The game gets interrupted at 

a moment when Blaise needs n more points to win and Pierre needs m more 

to win. How should the game’s prize money be divvied up? Fermat argued 

that Blaise should receive a proportion of the total stake equal to the chance 
he would have won if the game hadn’t been interrupted (and Pierre receives 
the remainder). 

Assume the game is played in rounds, the winner of each round gets 

1 point, rounds are independent, and the two players are equally likely to 

win any particular round. 

(a) Write a program to simulate the rounds of the game that would have 
happened after play was interrupted. A single simulation run should 
terminate as soon as Blaise has n wins or Pierre has m wins (equivalently, 
Blaise has m losses). Use your program to estimate P(Blaise gets 10 wins 
before 15 losses), which is the proportion of the total stake Blaise should 
receive if n= 10 and m= 15. 

(b) Use your same program to estimate the relevant probability when 
n=m= 10. Logically, what should the answer be? Is your estimated 
probability close to that? 

(c) Finally, let’s assume Pierre is actually the better player: P(Blaise wins a 
round) = .4. Again with n = 10 and m= 15, what proportion of the stake 
should be awarded to Blaise? 

Twenty faculty members in a certain department have just participated in a 
department chair election. Suppose that candidate A has received 12 of the 
votes and candidate B the other 8 votes. If the ballots are opened one by one 
in random order and the candidate selected on each ballot is recorded, use 
simulation to estimate the probability that candidate A remains ahead of 
candidate B throughout the vote count (which happens if, for example, the 
result is AA...AB...B but not if the result is AABABB...). 
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Show that the (estimated) standard error for P(A) is at most 1/./4n. 

Simulation can be used to estimate numerical constants, such as x. Here’s 

one approach: consider the part of a disk of radius 1 that lies in the first 

quadrant (a quarter-circle). Imagine two random numbers, x and y, both 
between O and 1. The pair (x, y) lies somewhere in the first quadrant; let 

A denote the event that (x, y) falls inside the quarter-circle. 

(a) Write a program that simulates pairs (x, y) in order to estimate P(A), the 
probability that a randomly selected pair of points in the square [0, 1] x 
(0, 1] lies in the quarter-circle of radius 1. 

(b) Using techniques from Chap. 4, it can be shown that the exact probabil- 
ity of A is 1/4 (which makes sense, because that’s the ratio of the quarter- 
circle’s area to the square’s area). Use that fact to come up with an 
estimate of x from your simulation. How close is your estimate to 
3.14159...? 

Consider the quadratic equation ax* + bx +c =0. Suppose that a, b, and c are 

random numbers between 0 and 1 (like those produced by an RNG). Esti- 

mate the probability that the roots of this quadratic equation are real. [Hint: 

Think about the discriminant.] This probability can be computed exactly 

using methods from Chap. 4, but a triple integral is required. 
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A small manufacturing company will start operating a night shift. There are 

20 machinists employed by the company. 

(a) If a night crew consists of 3 machinists, how many different crews are 
possible? 

(b) If the machinists are ranked 1, 2, ..., 20 in order of competence, how 
many of these crews would not have the best machinist? 

(c) How many of the crews would have at least | of the 10 best machinists? 

(d) If one of these crews is selected at random to work on a particular night, 
what is the probability that the best machinist will not work that night? 

A factory uses three production lines to manufacture cans of a certain type. 

The accompanying table gives percentages of nonconforming cans, 

categorized by type of nonconformance, for each of the three lines during 

a particular time period. 


Line 1 Line 2 Line 3 
Blemish 15 2) 20 
Crack 50 44 40 
Pull-Tab Problem 21 28 24 
Surface Defect 10 8 15 


Other 4 8 2) 
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During this period, line 1 produced 500 nonconforming cans, line 2 pro- 
duced 400 such cans, and line 3 was responsible for 600 nonconforming 
cans. Suppose that one of these 1,500 cans is randomly selected. 

(a) What is the probability that the can was produced by line 1? That the 
reason for nonconformance is a crack? 

(b) If the selected can came from line 1, what is the probability that it had a 
blemish? 

(c) Given that the selected can had a surface defect, what is the probability 
that it came from line 1? 

An employee of the records office at a university currently has ten forms on 

his desk awaiting processing. Six of these are withdrawal petitions and the 

other four are course substitution requests. 

(a) If he randomly selects six of these forms to give to a subordinate, what is 
the probability that only one of the two types of forms remains on his 
desk? 

(b) Suppose he has time to process only four of these forms before leaving 
for the day. If these four are randomly selected one by one, what is the 
probability that each succeeding form is of a different type from its 
predecessor? 

One satellite is scheduled to be launched from Cape Canaveral in Florida, 

and another launching is scheduled for Vandenberg Air Force Base in 

California. Let A denote the event that the Vandenberg launch goes off on 

schedule, and let B represent the event that the Cape Canaveral launch goes 

off on schedule. If A and B are independent events with P(A) > P(B), 

P(A UB) =.626, and P(ANM B) = .144, determine the values of P(A) and P(B). 

A transmitter is sending a message by using a binary code, namely, a 

sequence of Os and 1s. Each transmitted bit (0 or 1) must pass through 

three relays to reach the receiver. At each relay, the probability is .20 that 
the bit sent will be different from the bit received (a reversal). Assume that 
the relays operate independently of one another. 

Transmitter — Relay 1 — Relay 2 — Relay 3 — Receiver 
(a) If a 1 is sent from the transmitter, what is the probability that a 1 is sent 

by all three relays? 

(b) If a 1 is sent from the transmitter, what is the probability that a 1 is 
received by the receiver? [Hint: The eight experimental outcomes can be 
displayed on a tree diagram with three generations of branches, one 
generation for each relay.] 

(c) Suppose 70% of all bits sent from the transmitter are 1s. Ifa 1 is received 
by the receiver, what is the probability that a 1 was sent? 

Individual A has a circle of five close friends (B, C, D, E, and F). A has heard 

a certain rumor from outside the circle and has invited the five friends to a 

party to circulate the rumor. To begin, A selects one of the five at random and 

tells the rumor to the chosen individual. That individual then selects at 
random one of the four remaining individuals and repeats the rumor. 

Continuing, a new individual is selected from those not already having 
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heard the rumor by the individual who has just heard it, until everyone has 

been told. 

(a) What is the probability that the rumor is repeated in the order B, C, D, E, 
and F? 

(b) What is the probability that F is the third person at the party to be told the 
rumor? 

(c) What is the probability that F is the last person to hear the rumor? 

Refer to the previous exercise. If at each stage the person who currently 

“has” the rumor does not know who has already heard it and selects the next 

recipient at random from all five possible individuals, what is the probability 

that F has still not heard the rumor after it has been told ten times at the 

party? 

According to the article “Optimization of Distribution Parameters for 

Estimating Probability of Crack Detection” (J. of Aircraft, 2009: 2090- 

2097), the following “Palmberg” equation is commonly used to determine 

the probability P.(c) of detecting a crack of size c in an aircraft structure: 


(c/e'y 
Pa(c) = 1+ ey 


where c* is the crack size that corresponds to a .5 detection probability (and 
thus is an assessment of the quality of the inspection process). 

(a) Verify that P(c*)=.5. 

(b) What is P,(2c*) when f = 4? 

(c) Suppose an inspector inspects two different panels, one with a crack size 
of c* and the other with a crack size of 2c*. Again assuming / =4 and 
also that the results of the two inspections are independent of one 
another, what is the probability that exactly one of the two cracks will 
be detected? 

(d) What happens to P,(c) as B — co? 

A sonnet is a 14 line poem in which certain rhyming patterns are followed. 

The writer Raymond Queneau published a book containing just 10 sonnets, 

each on a different page. However, these were such that the first line of a 

sonnet could come from the first line on any of the 10 pages, the second line 

could come from the second line on any of the ten pages, and so on 

(successive lines were perforated for this purpose). 

(a) How many sonnets can be created from the 10 in the book? 

(b) If one of the sonnets counted in (a) is selected at random, what is the 
probability that all 14 lines come from exactly two of the ten pages? 

A chemical engineer is interested in determining whether a certain trace 
impurity is present in a product. An experiment has a probability of .80 of 
detecting the impurity if it is present. The probability of not detecting the 
impurity if it is absent is .90. The prior probabilities of the impurity being 
present and being absent are .40 and .60, respectively. Three separate 
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experiments result in only two detections. What is the posterior probability 

that the impurity is present? 

Fasteners used in aircraft manufacturing are slightly crimped so that they lock 

enough to avoid loosening during vibration. Suppose that 95% of all fasteners 

pass an initial inspection. Of the 5% that fail, 20% are so seriously defective 

that they must be scrapped. The remaining fasteners are sent to a recrimping 

operation, where 40% cannot be salvaged and are discarded. The other 60% of 

these fasteners are corrected by the recrimping process and subsequently pass 

inspection. 

(a) What is the probability that a randomly selected incoming fastener will 
pass inspection either initially or after recrimping? 

(b) Given that a fastener passed inspection, what is the probability that it 
passed the initial inspection and did not need recrimping? 

One percent of all individuals in a certain population are carriers of a 

particular disease. A diagnostic test for this disease has a 90% detection rate 

for carriers and a 5% detection rate for noncarriers. Suppose the test is applied 

independently to two different blood samples from the same randomly 

selected individual. 

(a) What is the probability that both tests yield the same result? 

(b) If both tests are positive, what is the probability that the selected individ- 
ual is a carrier? 

A system consists of two components. The probability that the second com- 

ponent functions in a satisfactory manner during its design life is .9, the 

probability that at least one of the two components does so is .96, and the 

probability that both components do so is .75. Given that the first component 

functions in a satisfactory manner throughout its design life, what is the 

probability that the second one does also? 

A certain company sends 40% of its overnight mail parcels via express mail 

service E,. Of these parcels, 2% arrive after the guaranteed delivery time 

(denote the event “late delivery” by L). If a record of an overnight mailing is 

randomly selected from the company’s file, what is the probability that the 

parcel went via F, and was late? 

Refer to the previous exercise. Suppose that 50% of the overnight parcels are 

sent via express mail service E> and the remaining 10% are sent via E3. Of 

those sent via E>, only 1% arrive late, whereas 5% of the parcels handled by 

E; arrive late. 

(a) What is the probability that a randomly selected parcel arrived late? 

(b) If a randomly selected parcel has arrived on time, what is the probability 
that it was not sent via F,? 

A company uses three different assembly lines—A,, A>, and A3—to manufac- 

ture a particular component. Of those manufactured by line A,, 5% need 

rework to remedy a defect, whereas 8% of A2’s components need rework 

and 10% of A3’s need rework. Suppose that 50% of all components are 

produced by line A;, 30% are produced by line A>, and 20% come from line 


1.7 Supplementary Exercises (121-150) 77 


137. 


138. 


139. 


A3. If a randomly selected component needs rework, what is the probability 
that it came from line A,? From line Az? From line A3? 

Disregarding the possibility of a February 29 birthday, suppose a randomly 
selected individual is equally likely to have been born on any one of the other 
365 days. If ten people are randomly selected, what is the probability that 
either at least two have the same birthday or at least two have the same last 
three digits of their Social Security numbers? [Note: The article “Methods for 
Studying Coincidences” (F. Mosteller and P. Diaconis, J. Amer. Statist. 
Assoc., 1989: 853-861) discusses problems of this type.] 

One method used to distinguish between granitic (G) and basaltic (B) rocks is 
to examine a portion of the infrared spectrum of the sun’s energy reflected 
from the rock surface. Let R,, Ro, and R3 denote measured spectrum 
intensities at three different wavelengths; typically, for granite Ry <R2<Rs3, 
whereas for basalt R3<R,<R,. When measurements are made remotely 
(using aircraft), various orderings of the R;s may arise whether the rock is 
basalt or granite. Flights over regions of known composition have yielded the 
following information: 


Granite Basalt 
Ri <Ro<R3 60% 10% 
Ri <R3<R2 25% 20% 
R3<R, <R2 15% 10% 


Suppose that for a randomly selected rock in a certain region, 

P(granite) = .25 and P(basalt) = .75. 

(a) Show that P(granitelR; <R2<R3) >P(basaltlIR; < R< R3). If 
measurements yielded R,; < Ry < R3, would you classify the rock as gran- 
ite or basalt? 

(b) If measurements yielded R, < R3 < R2, how would you classify the rock? 
Answer the same question for R3 << R, < Ro. 

(c) Using the classification rules indicated in parts (a) and (b), when selecting 
a rock from this region, what is the probability of an erroneous classifica- 
tion? [Hint: Either G could be classified as B or B as G, and P(B) and P(G) 
are known. | 

(d) If P(granite) =p rather than .25, are there values of p (other than 1) for 
which a rock would always be classified as granite? 

In a Little League baseball game, team A’s pitcher throws a strike 50% of the 

time and a ball 50% of the time, successive pitches are independent of each 

other, and the pitcher never hits a batter. Knowing this, team B’s manager has 
instructed the first batter not to swing at anything. Calculate the probability 
that 

(a) The batter walks on the fourth pitch. 

(b) The batter walks on the sixth pitch (so two of the first five must be strikes), 
using a counting argument or constructing a tree diagram. 

(c) The batter walks. 
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(d) The first batter up scores while no one is out (assuming that each batter 
pursues a no-swing strategy). 
Consider a woman whose brother is afflicted with hemophilia, which implies 
that the woman’s mother has the hemophilia gene on one of her two X 
chromosomes (almost surely not both, since that is generally fatal). Thus 
there is a 50-50 chance that the woman’s mother has passed on the bad 
gene to her. The woman has two sons, each of whom will independently 
inherit the gene from one of her two chromosomes. If the woman herself has a 
bad gene, there is a 50-50 chance she will pass this on to a son. Suppose that 
neither of her two sons is afflicted with hemophilia. What then is the proba- 
bility that the woman is indeed the carrier of the hemophilia gene? What is this 
probability if she has a third son who is also not afflicted? 
A particular airline has 10 a.m. flights from Chicago to New York, Atlanta, 
and Los Angeles. Let A denote the event that the New York flight is full 
and define events B and C analogously for the other two flights. Suppose 
P(A) =.6, P(B) =.5, P(C) = 4 and the three events are independent. What is 
the probability that 
(a) All three flights are full? That at least one flight is not full? 
(b) Only the New York flight is full? That exactly one of the three flights is 
full? 
Consider four independent events A,, A2, A3, and A, and let p;=P(A,) for 
i=1, 2, 3, 4. Express the probability that at least one of these four events 
occurs in terms of the p;s, and do the same for the probability that at least two 
of the events occur. 
A box contains the following four slips of paper, each having exactly the same 
dimensions: (1) win prize 1; (2) win prize 2; (3) win prize 3; (4) win prizes 
1, 2, and 3. One slip will be randomly selected. Let A; = {win prize 1}, 
Az = {win prize 2}, and A3 = { win prize 3}. Show that A, and A> are indepen- 
dent, that A, and A3 are independent, and that A> and A3 are also independent 
(this is pairwise independence). However, show that P(A;MA,NA3)4 
P(A,)-P(A2)- P(A3), so the three events are not mutually independent. 
Jurors may be a priori biased for or against the prosecution in a criminal trial. 
Each juror is questioned by both the prosecution and the defense (the voir dire 
process), but this may not reveal bias. Even if bias is revealed, the judge may 
not excuse the juror for cause because of the narrow legal definition of bias. 
For a randomly selected candidate for the jury, define events Bo, B,, and B> as 
the juror being unbiased, biased against the prosecution, and biased against 
the defense, respectively. Also let C be the event that bias is revealed during 
the questioning and D be the event that the juror is eliminated for cause. 
Let b;=P(B;) @=0, 1, 2), c= P(CIB,) = P(CIB2), and d= P(DIB; NC) = 
P(DIB2C) [“Fair Number of Peremptory Challenges in Jury Trials,” 
J. Amer. Statist. Assoc., 1979: 747-753]. 
(a) If a juror survives the voir dire process, what is the probability that he/she 
is unbiased (in terms of the b;s, c, and d)? What is the probability that 
he/she is biased against the prosecution? What is the probability that 
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he/she is biased against the defense? [Hint: Represent this situation using 
a tree diagram with three generations of branches. ] 

(b) What are the probabilities requested in (a) if bp = .50, b; =.10, by =.40 
(all based on data relating to the famous trial of the Florida murderer Ted 
Bundy), c=.85 (corresponding to the extensive questioning appropriate 
in a capital case), and d=.7 (a “moderate” judge)? 

Gambler’ s Ruin. Allan and Beth currently have $2 and $3, respectively. A fair 

coin is tossed. If the result of the toss is heads, Allan wins $1 from Beth, 

whereas if the coin toss results in tails, then Beth wins $1 from Allan. This 
process is then repeated, with a coin toss followed by the exchange of $1, until 
one of the two players goes broke (one of the two gamblers is ruined). We 
wish to determine a7 = P(Allan is the winner | he starts with $2). To do so, 

let’s also consider a; = P(Allan wins | he starts with $7) for i= 0, 1,3, 4, and 5. 

(a) What are the values of dg and a5? 

(b) Use the Law of Total Probability to obtain an equation relating a2 to a, 

and a3. [Hint: Condition on the result of the first coin toss, realizing that if 

it is heads, then from that point Allan starts with $3.] 

Using the logic described in (b), develop a system of equations relating 

a; (i= 1, 2, 3, 4) to a;_; and a;,,. Then solve these equations. [Hint: Write 

each equation so that a; — a;_, 1s on the left hand side. Then use the result 

of the first equation to express each other a; — a;_; as a function of a), and 

add together all four of these expressions (i = 2, 3, 4, 5).] 

(d) Generalize the result to the situation in which Allan’s initial fortune 
is $a and Beth’s is $b. [Note: The solution is a bit more complicated if 
p=P(Allan wins $1)4.5. We'll explore Gambler’s Ruin again in 
Chap. 6.] 

The Matching Problem. Four friends—Allison, Beth, Carol, and Diane—who 

have identical calculators are studying for a statistics exam. They set their 

calculators down in a pile before taking a study break and then pick them up in 
random order when they return from the break. 

(a) What is the probability all four friends pick up the correct calculator? 

(b) What is the probability that at least one of the four gets her own calcula- 
tor? [Hint: Let A be the event that Alice gets her own calculator, and 
define events B, C, and D analogously for the other three students. How 
can the event {at least one gets her own calculator} be expressed in terms 
of the four events A, B, C, and D? Now use a general law of probability. ] 

(c) Generalize the answer from part (b) to ” individuals. Can you recognize 
the result when n is large (the approximation to the resulting series)? 

An event A is said to attract event B if P(BIA) > P(B) and repel B if P(BIA) < 

P(B). (This refines the notion of dependent events by specifying whether 

A makes B more likely or less likely to occur.) 

(a) Show that if A attracts B, then A repels B’. 

(b) Show that if A attracts B, then A’ repels B. 

(c) Prove the Law of Mutual Attraction: event A attracts event B if, and only 
if, B attracts A. 
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The Secretary Problem. A personnel manager is to interview four candidates 
for a job. These are ranked 1, 2, 3, and 4 in order of preference and will be 
interviewed in random order. However, at the conclusion of each interview, 
the manager will know only how the current candidate compares to those 
previously interviewed. For example, the interview order 3, 4, 1, 2 generates 
no information after the first interview, shows that the second candidate is 
worse than the first, and that the third is better than the first two. However, the 
order 3, 4, 2, 1 would generate the same information after each of the first 
three interviews. The manager wants to hire the best candidate but must make 
an irrevocable hire/no hire decision after each interview. Consider the follow- 
ing strategy: Automatically reject the first s candidates and then hire the first 
subsequent candidate who is best among those already interviewed (if no such 
candidate appears, the last one interviewed is hired). 

For example, with s= 2, the order 3, 4, 1, 2 would result in the best being 
hired, whereas the order 3, 1, 2, 4 would not. Of the four possible s values 
(0, 1, 2, and 3), which one maximizes P(best is hired)? [Hint: Write out the 
24 equally likely interview orderings: s=0 means that the first candidate is 
automatically hired.] 

Jay and Maurice are playing a tennis match. In one particular game, they have 

reached deuce, which means each player won three points. Now in order to 

finish the game, one of the two players must get two points ahead of the other. 

For example, Jay will win if he wins the next two points (JJ), or if Maurice 

wins the next point and Jay the three points after that (WJJJ), or if the result of 

the next six points is JMMJJJ/, etc. 

(a) Suppose that the probability of Jay winning a point is .6 and outcomes of 
successive points are independent of one another. What is the probability 
that Jay wins the game? [Hint: In the law of total probability, let A; = {Jay 
wins each of the next two points}, A> = {Maurice wins each of the next 
two points}, and A3 = {each player wins one of the next two points}. Also 
let p=P(Jay wins the game). How does p compare to P(Jay wins the 
gamelA3)?] 

(b) If Jay wins the game, what is the probability that he needed only two 
points to do so? 

Here is a variant on one of the puzzles mentioned in the book’s Introduction. 

A fair coin is tossed repeatedly until either the sequence TTH or the sequence 

THT is observed. Let B be the event that stopping occurs because TTH was 

observed (i.e., that TTH is observed before THT). Calculate P(B). [Hint: 

Consider the following partition of the sample space: A; = {Ist toss is H}, 

A» = {1st two tosses are TT}, A3 = { lst three tosses are THT}, and Ay = {1st 

three tosses are THH}. Also denote P(B) by p. Apply the Law of Total 

Probability, and p will appear on both sides in various places. The resulting 

equation is easily solved for p.] 


Discrete Random Variables and Probability 2 
Distributions 


Suppose a city’s traffic engineering department monitors a certain intersection 
during a one-hour period in the middle of the day. Many characteristics might be 
of interest to the observers, including the number of vehicles that enter the inter- 
section, the largest number of vehicles in the left turn lane during a signal cycle, the 
speed of the fastest vehicle going through the intersection, the average speed of all 
vehicles entering the intersection. The value of each one of the foregoing variable 
quantities is subject to uncertainty—we don’t know a priori how many vehicles will 
enter, what the maximum speed will be, etc. So each of these is referred to as a 
random variable—a variable quantity whose value is determined by what happens 
in a chance experiment. 

There are two fundamentally different types of random variables, discrete and 
continuous. In this chapter we examine the basic properties and introduce the most 
important examples of discrete random variables. Chapter 3 covers the same 
territory for continuous random variables. 


2.1 Random Variables 


In any experiment, numerous characteristics can be observed or measured, but in 
most cases an experimenter will focus on some specific aspect or aspects of a 
sample. For example, in a study of commuting patterns in a metropolitan area, each 
individual in a sample might be asked about commuting distance and the number of 
people commuting in the same vehicle, but not about IQ, income, family size, and 
other such characteristics. Alternatively, a researcher may test a sample of 
components and record only the number that have failed within 1000 hours, rather 
than record the individual failure times. 

In general, each outcome of an experiment can be associated with a number by 
specifying a rule of association (e.g., the number among the sample of ten 
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Fig. 2.1 A random variable 


S 


components that fail to last 1,000 h or the total weight of baggage for a sample of 
25 airline passengers). Such a rule of association is called a random variable—a 
variable because different numerical values are possible and random because the 
observed value depends on which of the possible experimental outcomes results 
(Fig. 2.1). 


DEFINITION 

For a given sample space ¥ of some experiment, a random variable (rv) is 
any rule that associates a number with each outcome in S. In mathematical 
language, a random variable is a function whose domain is the sample space 
and whose range is some subset of real numbers. 


Random variables are customarily denoted by uppercase letters, such as X and Y, 
near the end of our alphabet. We will use lowercase letters to represent some 
particular value of the corresponding random variable. The notation X(s) = x 
means that x is the value associated with the outcome s by the rv X. 


Example 2.1 When a student attempts to connect to a university computer system, 
either there is a failure (F’) or there is a success (S$). With °= {S, F}, define an rv 
X by X(S) = 1, X(F) = 0. The rv X indicates whether (1) or not (0) the student can 
connect. a 


In Example 2.1, the rv X was specified by explicitly listing each element of and 
the associated number. If ’ contains more than a few outcomes, such a listing is 
tedious, but it can frequently be avoided. 


Example 2.2 Consider the experiment in which a telephone number in a certain 
area code is dialed using a random number dialer (such devices are used extensively 
by polling organizations), and define an rv Y by 


y= 1 if the selected number is unlisted 
~ [0 if the selected number is listed in the directory 


For example, if 5282966 appears in the telephone directory, then ¥(5282966) = 0, 
whereas Y(7727350) = 1 tells us that the number 7727350 is unlisted. A word 
description of this sort is more economical than a complete listing, so we will use 
such a description whenever possible. a 
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In Examples 2.1 and 2.2, the only possible values of the random variable were 
0 and 1. Such a random variable arises frequently enough to be given a special 
name, after the individual who first studied it. 


DEFINITION 
Any random variable whose only possible values are 0 and | is called a 
Bernoulli random variable. 


We will often want to define and study several different random variables from 
the same sample space. 


Example 2.3. Example 1.3 described an experiment in which the number of pumps 
in use at each of two gas stations was determined. Define rvs X, Y, and U by 
X = the total number of pumps in use at the two stations 
Y = the difference between the number of pumps in use at station | and the number 
in use at station 2 

U = the maximum of the numbers of pumps in use at the two stations 

If this experiment is performed and s = (2, 3) results, then X((2, 3)) =2 +3 =5, so 
we say that the observed value of X is x = 5. Similarly, the observed value of Y would 
be y = 2 — 3 = —1, and the observed value of U would be u = max(2,3)= 3. 


Each of the random variables of Examples 2.1—2.3 can assume only a finite 
number of possible values. This need not be the case. 


Example 2.4 Consider an experiment in which 9-V batteries are examined until 
one with an acceptable voltage (S) is obtained. The sample space is = {S, FS, FFS, 
... }. Define an rv X by 
X = the number of batteries examined before the experiment terminates 

Then X(S) = 1, X(F'S) = 2, X(FFS) = 3, ..., XYFFFFFFS) = 7, and so on. Any 
positive integer is a possible value of X, so the set of possible values is infinite. Ml 


Example 2.5 Suppose that in some random fashion, a location (latitude and 
longitude) in the continental USA is selected. Define an rv Y by 
Y = the height, in feet, above sea level at the selected location 

For example, if the selected location were (39°50/N, 98°35’W), then we might 
have Y((39°50/N, 98°35’/W)) = 1748.26 ft. The largest possible value of Y is 14,494 
(Mt. Whitney), and the smallest possible value is —282 (Death Valley). The set of 
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all possible values of Y is the set of all numbers in the interval between —282 and 
14,494; that is, the range of Y is 


{y : —282 <y < 14,494} = [—282, 14,494] 


and there are an infinite number of numbers in this interval. | 


2.1.1 Two Types of Random Variables 


Determining the values of variables such as the number of visits to a Web site 
during a 24-h period or the number of patients in an emergency room at a particular 
time requires only counting. On the other hand, determining values of variables 
such as fuel efficiency of a vehicle (mpg) or reaction time to a stimulus necessitates 
making a measurement of some sort. The following definition formalizes the 
distinction between these two different kinds of variables. 


DEFINITION 

A discrete random variable is an rv whose possible values constitute either a 

finite set or a countably infinite set (e.g., the set of all integers, or the set of all 

positive integers). 

A random variable is continuous if both of the following apply: 

1. Its set of possible values consists either of all numbers in a single interval 
on the number line (possibly infinite in extent, e.g., from —oo to oo) or all 
numbers in a disjoint union of such intervals (e.g., [0, 10] U [20, 30]). 

2. No possible value of the variable has positive probability, that is, PX = c) = 
0 for any possible value c. 


Although any interval on the number line contains an infinite number of num- 
bers, it can be shown that there is no way to create an infinite listing of all these 
values—there are just too many of them. The second condition describing a 
continuous random variable is perhaps counterintuitive, since it would seem to 
imply a total probability of zero for all possible values. But we shall see in Chap. 3 
that intervals of values have positive probability; the probability of an interval will 
decrease to zero as the width of the interval shrinks to zero. In practice, discrete 
variables virtually always involve counting the number of something, whereas 
continuous variables entail making measurements of some sort. 


Example 2.6 All random variables in Examples 2.1—2.4 are discrete. As another 
example, suppose we select married couples at random and do a blood test on each 
person until we find a husband and wife who both have the same Rh factor. 


2.1. Random Variables 85 


With X = the number of blood tests to be performed, possible values of X are {2, 4, 6, 
8, ...}. Since the possible values have been listed in sequence, X is a discrete rv. Ml 


To study basic properties of discrete rvs, only the tools of discrete mathematics— 
summation and differences—are required. The study of continuous variables in 
Chap. 3 will require the continuous mathematics of the calculus—integrals and 
derivatives. 


2.1.2. Exercises: Section 2.1 (1-10) 


1. A concrete beam may fail either by shear (S) or flexure (F’). Suppose that three 
failed beams are randomly selected and the type of failure is determined for 
each one. Let X = the number of beams among the three selected that failed by 
shear. List each outcome in the sample space along with the associated value 
of X. 

2. Give three examples of Bernoulli rvs (other than those in the text). 

3. Using the experiment in Example 2.3, define two more random variables and 
list the possible values of each. 

4. Let X = the number of nonzero digits in a randomly selected zip code. What are 
the possible values of X? Give three possible outcomes and their associated 
X values. 

5. If the sample space £ is an infinite set, does this necessarily imply that any rv 
X defined from S will have an infinite set of possible values? If yes, say why. If 
no, give an example. 

6. Starting at a fixed time, each car entering an intersection is observed to see 
whether it turns left (L), right (R), or goes straight ahead (A). The experiment 
terminates as soon as a car is observed to turn left. Let X = the number of cars 
observed. What are possible X values? List five outcomes and their associated 
X values. 

7. For each random variable defined here, describe the set of possible values for 
the variable, and state whether the variable is discrete. 

(a) X = the number of unbroken eggs in a randomly chosen standard egg carton 

(b) Y = the number of students on a class list for a particular course who are 
absent on the first day of classes 

(c) U =the number of times a duffer has to swing at a golf ball before hitting it 

(d) X = the length of a randomly selected rattlesnake 

(e) Z= the amount of royalties earned from the sale of a first edition of 10,000 
textbooks 

(f) Y = the acidity level (pH) of a randomly chosen soil sample 

(g) X = the tension (psi) at which a randomly selected tennis racket has been 
strung 

(h) X = the total number of coin tosses required for three individuals to obtain 
a match (HHH or TTT) 

8. Each time a component is tested, the trial is a success (S) or failure (F). 
Suppose the component is tested repeatedly until a success occurs on three 
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consecutive trials. Let Y denote the number of trials necessary to achieve this. 
List all outcomes corresponding to the five smallest possible values of Y, and 
state which Y value is associated with each one. 

9. An individual named Claudius is located at the point 0 in the accompanying 
diagram. 


Using an appropriate randomization device (such as a tetrahedral die, one 
having four sides), Claudius first moves to one of the four locations By, B2, B3, 
By. Once at one of these locations, he uses another randomization device to 
decide whether he next returns to 0 or next visits one of the other two adjacent 
points. This process then continues; after each move, another move to one of 
the (new) adjacent points is determined by tossing an appropriate die or coin. 
(a) Let X = the number of moves that Claudius makes before first returning to 

0. What are possible values of X? Is X discrete or continuous? 
(b) If moves are allowed also along the diagonal paths connecting 0 to Aj, A», 
A3, and Ay, respectively, answer the questions in part (a). 

10. The number of pumps in use at both a six-pump station and a four-pump station 
will be determined. Give the possible values for each of the following random 
variables: 

(a) T = the total number of pumps in use 

(b) X = the difference between the numbers in use at stations 1 and 2 
(c) U = the maximum number of pumps in use at either station 

(d) Z = the number of stations having exactly two pumps in use 


2.2.‘ Probability Distributions for Discrete Random Variables 


When probabilities are assigned to various outcomes in §, these in turn determine 
probabilities associated with the values of any particular rv X. The probability 
distribution of X says how the total probability of | is distributed among (allocated 
to) the various possible X values. 


Example 2.7 Six batches of components are ready to be shipped by a supplier. The 
number of defective components in each batch is as follows: 


Batch #1 #2 #3 #4 #5 #6 


Number of 0 2 0 1 2) 0 
defectives 
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One of these lots is to be randomly selected for shipment to a customer. Let X be the 
number of defectives in the selected lot. The three possible X values are 0, 1, and 2. Of 
the six equally likely simple events, three result in X = 0, one in X = 1, and the other 
two in X = 2. Let p(O) denote the probability that X = 0 and p(1) and p(2) represent the 
probabilities of the other two possible values of X. Then 


3 
p(0) = P(X = 0) = P(Iot 1 or 3 or 6 is sent) = = 500 


p(1) = P(X = 1) = P(lot 4 is sent) = — = .167 


Dale 


2 
p(2) = P(X = 2) = P(lot 2 or 5 is sent ais 333 


~N 


That is, a probability of .500 is distributed to the X value 0, a probability of .167 
is placed on the X value 1, and the remaining probability, .333, is associated with 
the X value 2. The values of X along with their probabilities collectively specify the 
probability distribution or probability mass function of X. If this experiment were 
repeated over and over again, in the long run X = 0 would occur one-half of the 
time, X = 1 one-sixth of the time, and X = 2 one-third of the time. | 


DEFINITION 
The probability distribution or probability mass function (pmf) of a 
discrete rv is defined for every number x by 

p(x) = P(X =x) = Pfalls © & X(s)=).' 


In words, for every possible value x of the random variable, the pmf specifies the 
probability of observing that value when the experiment is performed. The 
conditions p(x) > 0 and Xp(x) = 1, where the summation is over all possible x, 
are required of any pmf. 


Example 2.8 Consider randomly selecting a student at a large public university, 
and define a Bernoulli rv by X = | if the selected student does not qualify for 
in-state tuition (a success from the university administration’s point of view) and 
X = Oif the student does qualify. If 20% of all students do not qualify, the pmf for 
X is 

p(O) = P(X = 0) = P(the selected student does qualify) = .8 

pC) = P(X = 1) = P(the selected student does not qualify) = .2 

p(x) = P(X = x) = 0 forx 4 Oor 1. 


' P(X = x) is read “the probability that the rv X assumes the value x.” For example, P(X = 2) 
denotes the probability that the resulting X value is 2. 
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8 ifx=0 
p(ix)=4 2 ifx=1 
0 ifx#0orl 


Figure 2.2 is a picture of this pmf, called a line graph. 


P(x) A 
il LL 


0 


— 


Fig. 2.2. The line graph for the pmf in Example 2.8 | 


Example 2.9 Consider a group of five potential blood donors—A, B, C, D, and E—of 
whom only A and B have type O+ blood. Five blood samples, one from each individual, 
will be typed in random order until an O+ individual is identified. Let the rv Y = the 
number of typings necessary to identify an O+ individual. Then the pmf of Y is 


2 
p(1)=P(Y = 1) = P(A orB typed first) =5= 4 


p(2)=P(Y = 2) = P(C,D, or E first, and then A or B) 


; 3 2 
= P(C,D,or E first) -P(A or B next| C,D, or E first) ea 3 
. 3.2 2 
p(3) = P(Y = 3) = P(C,D, or E first and second, and then A or B) = — 45> 2 
; 3 2 1 
p(4) = P(Y = 4) = P(C,D, and E all done first) =o go! 


p(y) = Ofor y ¥ 1,2,3,4. 


The pmf can be presented compactly in tabular form: 


y | Z 3 4 
p(y) | 4 3 2 1 


where any y value not listed receives zero probability. Figure 2.3 shows the line 
graph for this pmf. 
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PCY) 


Fig. 2.3 The line graph for the pmf in Example 2.9 a 


The name “probability mass function” is suggested by a model used in physics 
for a system of “point masses.” In this model, masses are distributed at various 
locations x along a one-dimensional axis. Our pmf describes how the total proba- 
bility mass of 1 is distributed at various points along the axis of possible values of 
the random variable (where and how much mass at each x). 

Another useful pictorial representation of a pmf is called a probability histo- 
gram. Above each y with p(y) > 0, construct a rectangle centered at y. The height 
of each rectangle is proportional to p(y), and the base is the same for all rectangles. 
When possible values are equally spaced, the base is frequently chosen as the 
distance between successive y values (though it could be smaller). Figure 2.4 
shows two probability histograms. 


0 1 it 2 3 4 


Fig. 2.4 Probability histograms: (a) Example 2.8; (b) Example 2.9 


2.2.1 A Parameter of a Probability Distribution 


In Example 2.8, we had p(0) = .8 and p(1) = .2. At another university, it may be the 
case that p(0) = .9 and p(1) = .1. More generally, the pmf of any Bernoulli rv can be 
expressed in the form p(1) = a and p(0) = | — a, where 0 < a < 1. Because the pmf 
depends on the particular value of a, we often write p(x; a) rather than just p(x): 


l-a ifx=0 
p(x; a) = a ifx=1 (2.1) 
0 otherwise 


Then each choice of a in Expression (2.1) yields a different pmf. 
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DEFINITION 

Suppose p(x) depends on a quantity that can be assigned any one of a number 
of possible values, with each different value determining a different proba- 
bility distribution. Such a quantity is called a parameter of the distribution. 
The collection of all probability distributions for different values of the 
parameter is called a family of probability distributions. 


The quantity a in Expression (2.1) is a parameter. Each different number a 
between 0 and 1 determines a different member of a family of distributions; two 
such members are 


4 ifx=0 5 ifx=0 
p(x;.6)=¢ 6 ifx=1 and p(x;.5)=<¢ 5 ifx=1 
0 otherwise 0 otherwise 


Every probability distribution for a Bernoulli rv has the form of Expression (2.1), 
so it is called the family of Bernoulli distributions. 


Example 2.10 Starting at a fixed time, we observe the gender of each newborn 
child at a certain hospital until a boy (B) is born. Let p = P(B), assume that 
successive births are independent, and define the rv X by X = number of births 
observed. Then 


and 
p(3)= P(X = 3) = P(GGB) = P(G) - P(G) -P(B) = (1—p)’p 


Continuing in this way, a general formula emerges: 


Lp Pp eH 1233s 29 
P(x) { 0 otherwise ee) 


The quantity p in Expression (2.2) represents a number between 0 and | and is a 
parameter of the probability distribution. In the gender example, p = .51 might be 
appropriate, but if we were looking for the first child with Rh-positive blood, then 
we might have p = .85. The random variable X has what is known as a geometric 
distribution, which we will discuss in Sect. 2.6. | 
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2.2.2. The Cumulative Distribution Function 


For some fixed value x, we often wish to compute the probability that the observed 
value of X will be at most x. For example, let X be the number of beds occupied in a 
hospital’s emergency room at a certain time of day, and suppose the pmf of X is 
given by 


x 0 1 2 3 4 
px) |.20 25 30 15.10 


Then the probability that at most two beds are occupied is P(X < 2) = p(O) + 
pC) + p(2) = .75. Furthermore, since X < 2.7 iff X < 2, we also have P(X < 2.7) =.75, 
and similarly P(X < 2.999) = .75. Since 0 is the smallest possible X value, 
P(X < —1.5) = 0, PX < —10) = 0, and in fact for any negative number x, 
P(X < x) = 0. And because 4 is the largest possible value of X, P(X < 4) = 1, 
P(X < 9.8) = 1, and so on. 

Very importantly, P(X < 2) = p(0) + pC) = .45 < .75 = P(X < 2), because the 
latter probability includes the probability mass at the x value 2 whereas the former 
probability does not. More generally, P(X < x) < P(X < x) whenever x is a possible 
value of X. Furthermore, P(X < x) is a well-defined and computable probability for 
any number x. 


DEFINITION 
The cumulative distribution function (cdf) F(x) of a discrete rv X with pmf 
p(x) is defined for every number x by 


FQ) =PXK <2) = DY pb) (2:3) 


For any number x, F(x) is the probability that the observed value of X will 
be at most x. 


Example 2.11 A store carries flash drives with 1, 2, 4, 8, or 16 GB of memory. The 
accompanying table gives the distribution of X = the amount of memory in a 
purchased drive: 


x 1 2 4 8 16 
px) |.05 10 35 £40 «10 


Let’s first determine F(x) for each of the five possible values of X: 


92 2 Discrete Random Variables and Probability Distributions 


F(1)= P(X < 1) = P(X = 1) = p(1) = .05 

F(2)= P(X < 2) = P(X = lor2) =p(1) +p(2) = .15 

F(4)= P(X <4) = P(X = 1 or2 or 4) = p(1) + p(2) + p(4) = .50 
F(8)= P(X < 8) = p(1) + p(2) + p(4) + p(8) = .90 
F(16) = P(X < 16) =1 


Now for any other number x, F(x) will equal the value of F at the closest possible 
value of X to the left of x. For example, 


F(2.7) = P(X < 2.7) = P(X < 2) =F(2) =.15 
F(7.999) = P(X < 7.999) = P(X < 4) = F(4) = .50 


If x is less than 1, F(x) = 0 [e.g., F(.58) = 0], and if x is at least 16, F(x) = 1 [e.g., 
F(25) = 1]. The cdf is thus 


0 x«<l 

05 1<x<2 
AS 2<x<4 
50 4<x<8 
90 8<x< 16 
1 16<x 


A graph of this cdf is shown in Fig. 2.5. 


F(x) 
4 

1.0 4 ——r 

0.8 4 

0.6 4 

0.4 4 

0.2 4 

O04) ea 
T T 1 —— x 
0 5 10 15 20 

Fig. 2.5 A graph of the cdf of Example 2.11 a 


For X a discrete rv, the graph of F(x) will have a jump at every possible value of 
X and will be flat between possible values. Such a graph is called a step function. 
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Example 2.12 In Example 2.10, any positive integer was a possible X value, and 
the pmf was 


_ft=py pe #=1,2,3,... 
P(x) = { 0 otherwise 


For any positive integer x, 


tad 


x-1 


F(x) = S> pv) = 55 (1 -p)"p=p>_ (1-py (2.4) 
ysx y=1 y=0 
To evaluate this sum, we use the fact that the partial sum of a geometric series is 
k _ k+l 
a ers 
= l-a 


Using this in Eq. (2.4), with a= 1 — pandk =x — 1, gives 
(lp) 
F(x) =p: 


x)=p =1-(1-p)y X a positive integer 
1—(1—p) 
Since F is constant in between positive integers, 
_ 0 x<l 
Fa)={4_G-—p! eo (2.5) 


where [x] is the largest integer < x (e.g., [2.7] = 2). Thus if p = .51 as in the birth 
example, then the probability of having to examine at most five births to see the first 
boy is F(5) = 1 — (.49)° = 1 — .0282 = .9718, whereas F(10) + 1.0000. This cdf is 
graphed in Fig. 2.6. 


F(x) 
A 
_ 
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I T T T T IN T . 
0 1 2 3 4 5 50 51 


Fig. 2.6 A graph of F(x) for Example 2.12 a 
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In our examples thus far, the cdf has been derived from the pmf. This process can 
be reversed to obtain the pmf from the cdf whenever the latter function is available. 
Suppose, for example, that X represents the number of defective components in a 


shipment consisting of six components, so that possible X values are 0, 1, ..., 6. 
Then 
p(3) = P(X = 3) 
= [p(0) + p(1) + p(2) + (3)] — [p) + (1) + (2)] 
= P(X < 3) — P(X <2) 
= F(3) — F (2) 


More generally, the probability that X falls in a specified interval is easily 
obtained from the cdf. For example, 


Notice that P(2 < X < 4) 4 F(4) — F(2). This is because the X value 2 is 
included in 2 < X < 4, so we do not want to subtract out its probability. However, 
P2Q<X<4)=F(4) — F(2) because X = 2 is not included in the interval 2 < X < 4. 


PROPOSITION 
For any two numbers a and b witha < b, 


P(a <X <b) = F(b) — F(a) 


where “a—” represents the largest possible X value that is strictly less than a. 
In particular, if the only possible values are integers and if a and b are 
integers, then 


P(a<X <b)=P(X =aora+l1or...or b) 
= F(b) — F(a— 1) 


Taking a = b yields P(X = a) = F(a) — F(a — 1) in this case. 


The reason for subtracting F(a—) rather than F(a) is that we want to include 
P(X = a); F(b) — F(a) gives P(a < X < b). This proposition will be used 
extensively when computing binomial and Poisson probabilities in Sects. 2.4 
and 2.5. 
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Example 2.13 Let X = the number of days of sick leave taken by a randomly 
selected employee of a large company during a particular year. If the maximum 
number of allowable sick days per year is 14, possible values of X are 0, 1, ..., 14. 
With F(O) = .58, F(1) = .72, F(2) = .76, F(3) = .81, F(4) = .88, and F(5) = .94, 


P(2<X <5) =P(X =2,3,4,0r5) = F(5) — F(1) = .22 


and 


2.2.3. Another View of Probability Mass Functions 


It is often helpful to think of a pmf as specifying a mathematical model for a 
discrete population. 


Example 2.14 Consider selecting at random a household in a certain region, and 
Let X = the number of individuals in the selected household. Suppose the pmf of 
X is as follows: 


This is very close to the household size distribution for rural Thailand given in 
the article “The Probability of Containment for Multitype Branching Process 
Models for Emerging Epidemics” (J. of Applied Probability, 2011: 173-188), 
which modeled influenza transmission. 

Suppose this is based on one million households. One way to view this situation is to 
think of the population as consisting of 1,000,000 households, each one having its own 
X value; the proportion with each X value is given by p(x) in the above table. An 
alternative viewpoint is to forget about the households and think of the population itself 
as consisting of X values—14% of these values are 1, 17.5% are 2, and so on. The pmf 
then describes the distribution of the possible population values 1, 2, ..., 10. = 


Once we have such a population model, we will use it to compute values of 
various population characteristics such as the mean, which describes the center of 
the population distribution, and the standard deviation, which describes the extent 
of spread about the center. Both of these are developed in the next section. 
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2.2.4 Exercises: Section 2.2 (11-28) 


11. 


12. 


13. 


14. 


Let X be the number of students who show up at a professor’s office hours on a 
particular day. Suppose that the only possible values of X are 0, 1, 2, 3, and 
4, and that p(0) = .30, p(1) = .25, p(2) = .20, and p(3) = .15. 

(a) What is p(4)? 

(b) Draw both a line graph and a probability histogram for the pmf of X. 

(c) What is the probability that at least two students come to the office hour? 
What is the probability that more than two students come to the office 
hour? 

(d) What is the probability that the professor shows up for his office hour? 

Airlines sometimes overbook flights. Suppose that for a plane with 50 seats, 

55 passengers have tickets. Define the random variable Y as the number of 

ticketed passengers who actually show up for the flight. The probability mass 

function of Y appears in the accompanying table. 


y | 45 46 47 48 49 50 51 52 53 54 55 


(a) What is the probability that the flight will accommodate all ticketed 
passengers who show up? 

(b) What is the probability that not all ticketed passengers who show up can 
be accommodated? 

(c) If you are the first person on the standby list (which means you will be the 
first one to get on the plane if there are any seats available after all ticketed 
passengers have been accommodated), what is the probability that you 
will be able to take the flight? What is this probability if you are the third 
person on the standby list? 

A mail-order computer business has six telephone lines. Let X denote the 

number of lines in use at a specified time. Suppose the pmf of X is as given in 

the accompanying table. 


x [| 0 1 2 3 4 5 6 
px) | 10 15 20 25-20 06 —«C04 


Calculate the probability of each of the following events. 
(a) {at most three lines are in use} 
(b) {fewer than three lines are in use} 
(c) {at least three lines are in use} 
(d) {between two and five lines, inclusive, are in use} 
(e) {between two and four lines, inclusive, are not in use} 
(f) {at least four lines are not in use} 
A contractor is required by a county planning department to submit one, two, 
three, four, or five forms (depending on the nature of the project) in applying 
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15. 


16. 


17. 


for a building permit. Let Y = the number of forms required of the next 

applicant. The probability that y forms are required is known to be propor- 

tional to y—that is, p(y) = ky fory = 1,...,5. 

(a) What is the value of k? [Hint: > - 1pP(y)=1.] 

(b) What is the probability that at most three forms are required? 

(c) What is the probability that between two and four forms (inclusive) are 
required? 

(d) Could p(y) = y?/50 for y= 1,..., 5 be the pmf of Y? 
Many manufacturers have quality control programs that include inspection of 
incoming materials for defects. Suppose a computer manufacturer receives 
computer boards in lots of five. Two boards are selected from each lot for 
inspection. We can represent possible outcomes of the selection process by 
pairs. For example, the pair (1, 2) represents the selection of boards | and 2 for 
inspection. 

(a) List the ten different possible outcomes. 

(b) Suppose that boards 1 and 2 are the only defective boards in a lot of five. 
Two boards are to be chosen at random. Define X to be the number of 
defective boards observed among those inspected. Find the probability 
distribution of X. 

(c) Let F(x) denote the cdf of X. First determine F(0) = P(X < 0), F(1), and 
F(2), and then obtain F(x) for all other x. 

Some parts of California are particularly earthquake-prone. Suppose that in 

one such area, 25% of all homeowners are insured against earthquake damage. 

Four homeowners are to be selected at random; let X denote the number 

among the four who have earthquake insurance. 

(a) Find the probability distribution of X. [Hint: Let S denote a homeowner 
who has insurance and F one who does not. Then one possible outcome is 
SFSS, with probability (.25)(.75)(.25)(.25) and associated X value 3. 
There are 15 other outcomes. ] 

(b) Draw the corresponding probability histogram. 

(c) What is the most likely value for X? 

(d) What is the probability that at least two of the four selected have earth- 
quake insurance? 

A new battery’s voltage may be acceptable (A) or unacceptable (U). A certain 

flashlight requires two batteries, so batteries will be independently selected 

and tested until two acceptable ones have been found. Suppose that 90% of all 
batteries have acceptable voltages. Let Y denote the number of batteries that 
must be tested. 

(a) What is p(2), that is, P(Y = 2)? 

(b) What is p(3)? [Hint: There are two different outcomes that result in Y = 3.] 

(c) To have Y= 5, what must be true of the fifth battery selected? List the four 
outcomes for which Y = 5 and then determine p(5). 

(d) Use the pattern in your answers for parts (a)-(c) to obtain a general 
formula for p(y). 
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18. 


19. 


20. 


21. 


22. 


23. 
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Two fair six-sided dice are tossed independently. Let M = the maximum of the 
two tosses, so M(1, 5) = 5, M(3, 3) = 3, etc. 
(a) What is the pmf of M? [Hint: First determine p(1), then p(2), and so on.] 
(b) Determine the cdf of M and graph it. 
A library subscribes to two different weekly news magazines, each of which is 
supposed to arrive in Wednesday’s mail. In actuality, each one may arrive on 
Wednesday, Thursday, Friday, or Saturday. Suppose the two arrive indepen- 
dently of one another, and for each one P(W) = .3, P(Th) = .4, P(F) = .2, and P 
(S) = .1. Let Y = the number of days beyond Wednesday that it takes for both 
magazines to arrive (so possible Y values are 0, 1, 2, or 3). Compute the pmf of 
Y. [Hint: There are 16 possible outcomes; Y(W, W) = 0, Y(F, Th) = 2, and so 
on. | 
Three couples and two single individuals have been invited to an investment 
seminar and have agreed to attend. Suppose the probability that any particular 
couple or individual arrives late is .4 (a couple will travel together in the same 
vehicle, so either both people will be on time or else both will arrive late). 
Assume that different couples and individuals are on time or late independently 
of one another. Let X = the number of people who arrive late for the seminar. 
(a) Determine the probability mass function of X. [Hint: label the three couples 
#1, #2, and #3 and the two individuals #4 and #5.] 
(b) Obtain the cumulative distribution function of X, and use it to calculate 
P2<X <6). 
As described in the book’s Introduction, Benford’ s Law arises in a variety of 
situations as a model for the first digit of a number: 


p(x) = P(Ist digit is x) = los "), x=1,2,...,9 
(a) Without computing individual probabilities from this formula, show that it 
specifies a legitimate pmf. 
(b) Now compute the individual probabilities and compare to the distribution 
where 1, 2,..., 9 are equally likely. 
(c) Obtain the cdf of X, a rv following Benford’s law. 
(d) Using the cdf, what is the probability that the leading digit is at most 3? At 
least 5? 
Refer to Exercise 13, and calculate and graph the cdf F(x). Then use it to 
calculate the probabilities of the events given in parts (a)—(d) of that problem. 
Let X denote the number of vehicles queued up at a bank’s drive-up window at 
a particular time of day. The cdf of X is as follows: 


2.2 


24. 


25. 


26. 
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0 «<0 
06 O0<x<1 
19 1<x<2 
39 2<x<3 

POY 67 4<yecd 
92 4<x<5 
97 5<x<6 
1 6<x 


Calculate the following probabilities directly from the cdf: 
(a) p(2), that is, P(X = 2) 
(b) P(X > 3) 
(c) P2<X <5) 
(d) P(2<X <5) 
An insurance company offers its policyholders a number of different premium 
payment options. For a randomly selected policyholder, let X = the number of 
months between successive payments. The cdf of X is as follows: 


0 x<l 
30 1<x<3 
40 3<x<4 

BOS as A<x<6 
60 6<x<12 
1 2<x 


(a) What is the pmf of X? 

(b) Using just the cdf, compute P(3 < X < 6) and P(4 < X). 

In Example 2.10, let Y = the number of girls born before the experiment 
terminates. With p = P(B) and 1 — p = P(G), what is the pmf of Y? [Hint: 
First list the possible values of Y, starting with the smallest, and proceed until 
you see a general formula.] 

Alvie Singer lives at 0 in the accompanying diagram and has four friends who 
live at A, B, C, and D. One day Alvie decides to go visiting, so he tosses a fair 
coin twice to decide which of the four to visit. Once at a friend’s house, he will 
either return home or else proceed to one of the two adjacent houses (such as 
0, A, or C when at B), with each of the three possibilities having probability 1/3. 
In this way, Alvie continues to visit friends until he returns home. 


Aaah 
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(a) Let X = the number of times that Alvie visits a friend. Derive the pmf of X. 

(b) Let Y = the number of straight-line segments that Alvie traverses (includ- 
ing those leading to and from 0). What is the pmf of Y? 

(c) Suppose that female friends live at A and C and male friends at B and D. If 
Z = the number of visits to female friends, what is the pmf of Z? 

27. After all students have left the classroom, a statistics professor notices that four 
copies of the text were left under desks. At the beginning of the next lecture, the 
professor distributes the four books in a completely random fashion to each of the 
four students (1, 2, 3, and 4) who claim to have left books. One possible outcome 
is that 1 receives 2’s book, 2 receives 4’s book, 3 receives his or her own book, 
and 4 receives 1’s book. This outcome can be abbreviated as (2, 4, 3, 1). 

(a) List the other 23 possible outcomes. 
(b) Let X denote the number of students who receive their own book. Deter- 
mine the pmf of X. 

28. Show that the cdf F(x) is a nondecreasing function; that is, x; < x2 implies that 

F(x,) < F(x2). Under what condition will F(x,) = F(x2)? 


2.3 Expected Value and Standard Deviation 


Consider a university with 15,000 students and let X = the number of courses 
for which a randomly selected student is registered. The pmf of X follows. Since 
p(1) = .01, we know that (.01) - (15,000) = 150 of the students are registered for 
one course, and similarly for the other x values. 


x | 1 2 3 4 5 6 7 
p(x) 01 .03 .13 25 39 17 .02 (2.6) 
Number registered 150 450 1950 3750 5850 2550 300 


To compute the average number of courses per student, i.e., the average value of 
X in the population, we should calculate the total number of courses and divide by 
the total number of students. Since each of 150 students is taking one course, these 
150 contribute 150 courses to the total. Similarly, 450 students contribute 2(450) 
courses, and so on. The population average value of X is then 


1(150) + 2(450) + 3(1950) + --- + 7(300) 
15,000 


= 4.57 (2.7) 


Since 150/15,000 = .01 = p(1), 450/15,000 = .03 = p(2), and so on, an 
alternative expression for Eq. (2.7) is 


1-p(1)+2-p(2) +---+7- p(7) (2.8) 


Expression (2.8) shows that to compute the population average value of X, 
we need only the possible values of X along with their probabilities (proportions). 
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In particular, the population size is irrelevant as long as the pmf is given by (2.6). 
The average or mean value of X is then a weighted average of the possible values 
1, ..., 7, where the weights are the probabilities of those values. 


2.3.1 The Expected Value of X 


DEFINITION 
Let X be a discrete rv with set of possible values D and pmf p(x). The 
expected value or mean value of X, denoted by E(X) or sly or just yp, is 


EX) = pz — 7 — So xp) 


x&D 


Example 2.15 For the pmf of X = number of courses in (2.6), 


w=1-p(l)+2-p(2)+---+7-p(7) 
= (1)(.01) + (2)(.03) +--+ + (7)(.02) 
= .01 + .06 + .39 + 1.00 + 1.95 + 1.02 + .14 = 4.57 


If we think of the population as consisting of the X values 1, 2, ..., 7, then w= 4.57 
is the population mean (we will often refer to 4 as the population mean rather than 
the mean of X in the population). Notice that y here is not 4, the ordinary average of 
1, ..., 7, because the distribution puts more weight on 4, 5, and 6 than on other 
X values. a 


In Example 2.15, the expected value yz was 4.57, which is not a possible value of X. 
The word expected should be interpreted with caution because one would not 
expect to see an X value of 4.57 when a single student is selected. 


Example 2.16 Just after birth, each newborn child is rated on a scale called the 
Apgar scale. The possible ratings are 0, 1, .. ., 10, with the child’s rating determined 
by color, muscle tone, respiratory effort, heartbeat, and reflex irritability (the best 
possible score is 10). Let X be the Apgar score of a randomly selected child born at a 
certain hospital during the next year, and suppose that the pmf of X is 


x | 0 4 9 3 4 5 6 7 8 9 10 
p(x) | 002 .001 .002 .005 02 04 18 37 25 12 01 


Then the mean value of X is 
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002) + (1)(.001) + (2)(.002) +--+ + (8)(.25) + (9)(.12) + (10)(.01) 


(Again, p is not a possible value of the variable X.) If the stated model is correct, 
then the mean Apgar score for the population of all children born at this hospital 
next year will be 7.15. a 


Example 2.17 Let X = 1 if arandomly selected component needs warranty service 
and = 0 otherwise. If the chance a component needs warranty service is p, then X is 
a Bernoulli rv with pmf p(1) = p and p(O) = 1 — p, from which 


E(X) = 0-p(0) + 1-p(1) = 01 — p) + 1p) =p 


That is, the expected value of X is just the probability that X takes on the value 
1. If we conceptualize a population consisting of Os in proportion | — p and Is in 
proportion p, then the population average is yw = p. a 


There is another frequently used interpretation of «. Consider observing a first 
value x, of X, then a second value x2, a third value x3, and so on. After doing this a 
large number of times, calculate the sample average of the observed x;s. This 
average will typically be close to yw; a more rigorous version of this statement is 
provided by the Law of Large Numbers in Chap. 4. That is, 44 can be interpreted as 
the long-run average value of X when the experiment is performed repeatedly. This 
interpretation is often appropriate for games of chance, where the “population” is 
not a concrete set of individuals but rather the results of all hypothetical future 
instances of playing the game. 


Example 2.18 A standard American roulette wheel has 38 spaces. Players bet on 
which space a marble will land in once the wheel has been spun. One of the simplest 
bets is based on the color of the space: 18 spaces are black, 18 are red, and 2 are 
green. So, if a player “bets on black,” s/he has an 18/38 chance of winning. Casinos 
consider color bets an “even wager,” meaning that a player who wages $1 on black, 
say, will profit $1 if the marble lands in a black space (and lose the wagered $1 
otherwise). 
Let X = the return on a $1 wager on black. Then the pmf of X is 


a 
px) | 2038 | 18/38 


and the expected value of X is E(X) = (—1)(20/38) + (1)(18/38) = —2/38 = —$.0526. 
If a player makes $1 bets on black on successive spins of the roulette wheel, in the 
long run s/he can expect to lose about 5.26 cents per wager. Since players don’t 
necessarily make a large number of wagers, this long-run average interpretation is 
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perhaps more apt from the casino’s perspective: in the long run, they will gain an 
average of 5.26 cents for every $1 wagered on black at the roulette table. a 


Thus far, we have assumed that the mean of any given distribution exists. If the 
set of possible values of X is unbounded, so that the sum for jy is actually an infinite 
series, the expected value of X might or might not exist (depending on whether the 
series converges or diverges). 


Example 2.19 From Example 2.10, the general form for the pmf of X = the 
number of children born up to and including the first boy is 


_f(-p)'p x=1,2,3,... 
P(x) { 0 otherwise 


The expected value of X therefore entails evaluating an infinite summation: 


io. <) 


E(x) = Sox-p@) = Sx — py =p x py 


- Ds -z0 7 py : (2.9) 


If we interchange the order of taking the derivative and the summation in 
Eq. (2.9), the sum is that of a geometric series. (In particular, the infinite series 
converges for 0 < p < 1.) 

After the sum is computed and the derivative is taken, the final result is F(X) = 
1/p. That is, the expected number of children born up to and including the first boy 
is the reciprocal of the chance of getting a boy. This is actually quite intuitive: if p is 
near 1, we expect to see a boy very soon, whereas if p is near 0, we expect many 
births before the first boy. For p = .5, E(X) = 2. 

Exercise 48 at the end of this section presents an alternative method for comput- 
ing the mean of this particular distribution. a 


Example 2.20 Let X, the number of interviews a student has prior to getting a job, 
have pmf 


oe ka ge 2, eae: 
Py 0 otherwise 


where k is such that YS, (k/x*)=1. (Because ¥ &_, (1/x*) = 27/6, the value of 
kis 6/n~.) The expected value of X is 
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p= 8) = Sox =k (2.10) 


The sum on the right of Eq. (2.10) is the famous harmonic series of mathematics 
and can be shown to diverge. E(X) is not finite here because p(x) does not decrease 
sufficiently fast as x increases; statisticians say that the probability distribution of 
X has “a heavy tail.” If a sequence of X values is chosen using this distribution, the 
sample average will not settle down to some finite number but will tend to grow 
without bound. = 


2.3.2 The Expected Value of a Function 


Often we will be interested in the expected value of some function h(X) rather than 
X itself. An easy way of computing the expected value of h(X) is suggested by the 
following example. 


Example 2.21 The cost of a certain vehicle diagnostic test depends on the number 
of cylinders X in the vehicle’s engine. Suppose the cost function is h(X) = 20 + 3X 
+ .5X”. Since X is a random variable, so is Y = h(X). The pmf of X and the derived 
pmf of Y are as follows: 


x |4 6 8 mn y  |40 56 76 
po) {5 3 2 py) [5 3 2 


With D* denoting possible values of Y, 


E(Y) = E[h(X)] = 9° y- py) 


y&Dx« 


(40) (.5) + (56) (.3) + (76) (.2) = $52 

h(4) ee i - (3) + (8) - (2) 

aN) 

According to Eq. (2.11), it was not necessary to determine the pmf of Y to obtain 


E(Y); instead, the desired expected value is a weighted average of the possible h(x) 
(rather than x) values. | 


(2.11) 


PROPOSITION 
If the rv X has a set of possible values D and pmf p(x), then the expected value 
of any function h(X), denoted by E[h(X)] or zx), is computed by 


(continued) 
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Eh(X)] = > Ala) - p(x) 


D 


This is sometimes referred to as the Law of the Unconscious Statistician. 


According to this proposition, E[/(X)] is computed in the same way that E(X) 
itself is, except that h(x) is substituted in place of x. That is, E[h(X)] is a weighted 
average of possible A(X) values, where the weights are the probabilities of the 
corresponding original X values. 


Example 2.22 A computer store has purchased three computers at $500 apiece. It 
will sell them for $1,000 apiece. The manufacturer has agreed to repurchase any 
computers still unsold after a specified period at $200 apiece. Let X denote 
the number of computers sold, and suppose that p(0) = .1, p(1) = .2, p(2) = .3, 
and p(3) = .4. With h(X) denoting the profit associated with selling X units, the 
given information implies that h(X) = revenue — cost = 1000X + 200(3 — X) — 
1500 = 800X — 900. The expected profit is then 


E{h(X)] = h(0) - p(0) + A(1) - p(1) + A(2) - p(2) + A(3) - p(3) 
= (800(0) — 900)(.1) + (800(1) — 900)(.2) + (800(2) — 900)(.3) 
+ (800(3) — 900) (.4) 
= (—900)(.1) + (—100)(.2) + (700) (.3) + (1500)(.4) 
= $700 = 


Because an expected value is a sum, it possesses the same properties as any 
summation; specifically, the expected value “operator” can be distributed across 
addition and across multiplication by constants. This important property is known 
as linearity of expectation. 


LINEARITY OF EXPECTATION 
For any functions /,(X) and h2(X) and any constants aj, do, and b, 


Elayhy (xX) ar aghy(X) alr bj = a Eh (X)] ar anEh(X)] +b 
In particular, for any linear function aX + b, 
E(aX + b) =a-E(X)+b (2.12) 


(or, using alternative notation, fax., = a+ Hx + b). 
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Proof Let h(X) =a,h\(X) + azh2(X) +b, and apply the previous proposition: 
E[ayhy(X) + ah (x) + b] = > (ay hy (x) + aghy(x) + b) - p(x) 
D 
= a0 hy(x) p(x) +>) hol) - p(x) 
D D 
+ by p(x) distributive property of addition 
D 


=> QE 
= aE 


i parece ean 


The special case of aX + b is obtained by setting a, = a, h\(X) = X, and 
a, = 0. |_| 


By induction, linearity of expectation applies to any finite number of terms. 
In Example 2.21, it is easily computed that E(X) = 4(.5) + 6(.3) + 8(.2) = 5.4 and 
E(X*) = ¥ x: p(w) = 4.5) + 6.3) + 8°(.2) = 31.6. Applying linearity of 
expectation to Y = A(X) = 20 + 3X + 5X’, we obtain 


fy = E[20+ 3X + 5X7] = 20+ 3E(X) + .SE(X?) = 20+ 3(5.4) + .5(31.6) = $52, 


which matches the result of Example 2.21. 

The special case Eq. (2.12) states that the expected value of a linear function 
equals the linear function evaluated at the expected value E(X). Since h(X) in 
Example 2.22 is linear and E(X) = 2, E[h(X)] = 800(2) — 900 = $700, as before. 
Two special cases of Eq. (2.12) yield two important rules of expected value. 


1. For any constant a, fay = a: My (take b = 0). 
2. For any constant b, x4, = Wx + b = E(X) + b (take a = 1). 


Multiplication of X by a constant a changes the unit of measurement (from 
dollars to cents, where a = 100, inches to cm, where a = 2.54, etc.). Rule | says that 
the expected value in the new units equals the expected value in the old units 
multiplied by the conversion factor a. Similarly, if the constant b is added to each 
possible value of X, then the expected value will be shifted by that same amount. 

One commonly made error is to substitute jy directly into the function (X) when 
his anonlinear function, in which case Eq. (2.12) does not apply. Consider Example 
2.21: the mean of X is 5.4, and it’s tempting to infer that the mean of Y = h(X) 
is simply h(5.4). However, since the function h(X) = 20 + 3X + 5X° is not 
linear, this does not yield the correct answer: 


(5.4) = 20 + 3(5.4) + .5(5.4)* = 50.78 # 52 = py 


In general, fjcx) does not equal h(x) unless the function h(x) is linear. 
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2.3.3. The Variance and Standard Deviation of X 


The expected value of X describes where the probability distribution is centered. 
Using the physical analogy of placing point mass p(x) at the value x on a 
one-dimensional axis, if the axis were then supported by a fulcrum placed at p, 
there would be no tendency for the axis to tilt. This is illustrated for two different 
distributions in Fig. 2.7. 


a 
P(x) P(x) 


Fig. 2.7. Two different probability distributions with « = 4 


Although both distributions pictured in Fig. 2.7 have the same mean/fulcrum yp, 
the distribution of Fig. 2.7b has greater spread or variability or dispersion than does 
that of Fig. 2.7a. Our goal now is to obtain a quantitative assessment of the extent to 
which the distribution spreads out about its mean value. 


DEFINITION 
Let X have pmf p(x) and expected value . Then the variance of X, denoted 
by Var(X) or ox or just o”, is 


Var(X) = )> |(x— 0)? -p(x)] = ELK - »)"] 


D 


The standard deviation (SD) of X, denoted by SD(X) or ox or just o, is 


ox = vy Var(X) 


The quantity h(X) = (X — uy is the squared deviation of X from its mean, and 
o° is the expected squared deviation—i.e., a weighted average of the squared 
deviations from yp. Taking the square root of the variance to obtain standard 
deviation returns us to the original units of the variable, e.g., if X is measured in 
dollars, then both y and o also have units of dollars. If most of the probability 
distribution is close to yw, as in Fig. 2.7a, then o will typically be relatively small. 
However, if there are x values far from yw that have large probabilities (as in 
Fig. 2.7b), then o will be larger. 
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Example 2.23 Consider again the distribution of the Apgar score X of a randomly 
selected newborn described in Example 2.16. The mean value of X was calculated 
as ¢ = 7.15, so 


The standard deviation of X is SD(X) = o = V1.5815 = 1.26. | 


A rough interpretation of o is that its value gives the size of a typical or 
representative distance from p (hence, “standard deviation”). Because o = 1.26 in 
the preceding example, we can say that some of the possible X values differ by more 
than 1.26 from the mean value 7.15 whereas other possible X values are closer than 
this to 7.15; roughly, 1.26 is the size of a typical deviation from the mean Apgar 
score. 


Example 2.24 (Example 2.18 continued) The variance of X = the return on a $1 
bet on black is 


oz = (—1 — (—2/38))? - (20/38) + (1 — (—2/38))? - 18/38 = 0.99723 


and the standard deviation is oy = V0.99723 = 0.9986 =~ $1. The two possible 
values of X are —$1 and +$1; since betting on black is almost a break-even wager 
(the mean is quite close to 0), the typical difference between an actual return X and 
the average return py is roughly one dollar. a 


A natural probability question arises: how often does X fall within this “typical 
distance of the mean’? That is, what’s the chance that a rv X lies between py — oy 
and fly + ox? What about the likelihood that X is within two standard deviations of 
its mean? There are no universal answers: for different pmfs, varying amounts of 
probability may lie within one (or two or three) standard deviation(s) of the 
expected value. That said, the following theorem, due to Russian mathematician 
Pafnuty Chebyshev, partially addresses questions of this sort. 


CHEBYSHEV’S INEQUALITY 
Let X be a discrete rv with mean py and standard deviation o. Then, for any k > 1, 


1 
P(|X — p| > ko) Sp: 


That is, the probability X is at least k standard deviations away from its mean 
is at most 1/k’. 
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An equivalent statement to Chebyshev’s inequality is that every random variable 
has a probability of at least 1 — 1 /k° to fall within k standard deviations of its mean. 


Proof Let A denote the event IX — pl > ko; or, equivalently, the set of values 
{x : lx — pl > ko}. Begin by writing out the definition of Var(X): 


var(X) = > |(x =)? - po) 
(en)? -po)] + > [(e- )-pe| 


! 


A 


A 

S- (x — yp) - p(x)| because the discarded term is > 0 
i 1 

— 


IV 


IV 


(ko)? ee because (x — uy > (ko)? on the set A 
(ko) Pe) (ko)*P(A) = k’o?P(|X — p| > ko) 


The Var(X) term on the left-hand side is the same as the o° term on the right- 
hand side; cancelling the two, we are left with | > KP(IX — wl > ko), and 
Chebyshev’s inequality follows. a 


For k = 1, Chebyshev’s inequality states that P(IX — ul > o) < 1, which isn’t 
very informative since all probabilities are bounded above by 1. In fact, 
distributions can be constructed for which 100% of the distribution is at least 
1 standard deviation from the mean, so that the rv X has probability 0 of falling 
less than one standard deviation from its mean (see Exercise 47). Substituting k = 
2, Chebyshev’s inequality states that the chance any rv is at least 2 standard 
deviations from its mean cannot exceed 1/27 = .25 = 25%. Equivalently, every 
distribution has the property that at least 75% of its “mass” lies within 2 standard 
deviations of its mean value (in fact, for many distributions, the exact probability 
is more). 


2.3.4 Properties of Variance 


An alternative to the defining formula for Var(X) reduces the computational burden. 


PROPOSITION 
Var(X) = 0° = E(X*) — wv’ 


This equation is referred to as the variance shortcut formula. 
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In using this formula, E(X) is computed first without any subtraction; then p is 
computed, squared, and subtracted (once) from E(X°). This formula is more effi- 
cient because it entails only one subtraction, and E(X’) does not require calculating 
squared deviations from p. 


Example 2.25 Referring back to the Apgar score scenario of Examples 2.16 and 2.23, 


ES) = yor - p(x) = (0°) (.002) + (17) (.001) +--+ + (107) (.01) = 52.704 


Thus, 0? = 52.704 — (7.15)? = 1.5815 as before, and again o = 1.26. = 


Proof of the Variance Shortcut Formula Expand (X — )’ in the definition of 
Var(X), and then apply linearity of expectation: 


Var(X) = E[(X — p)"] = E[X? — 2pX +p" 
=E (x?) — 2uE (X) aoa by linearity of expectation 
= E(X*) -2n-p+ pw? =E(X*) - 2 +? 
— E(x’) = 5 a 


The quantity E(X°) in the variance shortcut formula is called the mean-square 
value of the random variable X. Engineers may be familiar with the root-mean- 
square, or RMS, which is the square root of E(X”). Do not confuse this with the square 
of the mean of X, i.e., Ww! For example, if X has a mean of 7.15, the mean-square value 
of X is not (7. 15)’, because h(x) = 2° is not linear. (In Example 2.25, the mean-square 
value of X is 52.704.) It helps to look at the two formulas side-by-side: 


2 
E(X?) = se -p(x) versus p? = (sre 


D 


The order of operations is clearly different. In fact, it can be shown (see Exercise 
46) that E(X*) > we for every random variable, with equality if and only if X is 
constant. 

The variance of a function A(X) is the expected value of the squared difference 
between /(X) and its expected value: 


Var[h(X)] = ox) = >_ [(ate) = Hie)” P(x) | 


2 
= » h? (x) i) - > h(x) (9) 


When A(x) is a linear function, Var[h(X)] has a much simpler expression (see 
Exercise 43 for a proof). 
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PROPOSITION 
Var(aX + b) = o7y,, =a oy and oax+n = |al - ox (2.13) 
In particular, 


Oax = la| SEONG and OXip = OX 


The absolute value is necessary because a might be negative, yet a standard 
deviation cannot be. Usually multiplication by a corresponds to a change in the unit 
of measurement (e.g., kg to lb or dollars to euros); the sd in the new unit is just the 
original sd multiplied by the conversion factor. On the other hand, the addition of 
the constant b does not affect the variance, which is intuitive, because the addition 
of b changes the location (mean value) but not the spread of values. Together, 
Eqs. (2.12) and (2.13) comprise the rescaling properties of mean and standard 
deviation. 


Example 2.26 In the computer sales scenario of Example 2.22, E(X) = 2 and 


E(X*) = (07)(.1) + (17)(.2) + (27)(.3) + (3’)(.4) =5 


so Var(X) = 5 — (zy = 1. The profit function Y = h(X) = 800X — 900 is linear, so 
Eq. (2.13) applies with a = 800 and b = —900. Hence Y has variance 
a’ox” = (800)7(1) = 640, 000 and standard deviation $800. = 


2.3.5 Exercises: Section 2.3 (29-48) 


29. The pmf of the amount of memory X (GB) in a purchased flash drive was 
given in Example 2.11 as 


x 1 2 4 8 16 
px) |.05 10 35 £40 «10 


(a) Compute and interpret E(X). 

(b) Compute Var(X) directly from the definition. 
(c) Obtain and interpret the standard deviation of X. 
(d) Compute Var(X) using the shortcut formula. 
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30. 


31. 


32. 


33. 


34. 


35. 
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An individual who has automobile insurance from a company is randomly 
selected. Let Y be the number of moving violations for which the individual 
was cited during the last 3 years. The pmf of Y is 


y | 0 1 D 3 
po) | 60 25. «1005 


(a) Compute E(Y). 

(b) Suppose an individual with Y violations incurs a surcharge of $100Y’. 
Calculate the expected amount of the surcharge. 

Refer to Exercise 12 and calculate Var(Y) and oy. Then determine the 

probability that Y is within | standard deviation of its mean value. 

An appliance dealer sells three different models of upright freezers having 

13.5, 15.9, and 19.1 cubic feet of storage space, respectively. Let X = the 

amount of storage space purchased by the next customer to buy a freezer. 

Suppose that X has pmf 


x | 3S 158° 194 
px) | 2 5 3 


(a) Compute E(X), E(X°), and Var(X). 

(b) If the price of a freezer having capacity X cubic feet is 17X + 180, what is 
the expected price paid by the next customer to buy a freezer? 

(c) What is the standard deviation of the price 17X + 180 paid by the next 
customer? 

(d) Suppose that although the rated capacity of a freezer is X, the actual 
capacity is h(X) = X — .01X?. What is the expected actual capacity of 
the freezer purchased by the next customer? 

Let X be a Bernoulli rv with pmf as in Example 2.17. 

(a) Compute E(X°). 

(b) Show that Var(X) = p( — p). 

(c) Compute E(X”’). 

Suppose that the number of plants of a particular type found in a rectangular 

sampling region (called a quadrat by ecologists) in a certain geographic area is 

an rv X with pmf 


Gs A, cae oo a ee 
P 0 otherwise 


Is E(X) finite? Justify your answer. (This is another distribution that 
statisticians would call heavy-tailed.) 

A small market orders copies of a certain magazine for its magazine rack each 
week. Let X = demand for the magazine, with pmf 


2.3 


36. 


37. 


38. 


39. 
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1 
PO) Ts 35 15 15 15 15 
Suppose the store owner actually pays $2.00 for each copy of the magazine 
and the price to customers is $4.00. If magazines left at the end of the week 
have no salvage value, is it better to order three or four copies of the 
magazine? [Hint: For both three and four copies ordered, express net revenue 
as a function of demand X, and then compute the expected revenue.] 
Let X be the damage incurred (in $) in a certain type of accident during a given 
year. Possible X values are 0, 1000, 5000, and 10,000, with probabilities .8, .1, 
.08, and .02, respectively. A particular company offers a $500 deductible 
policy. If the company wishes its expected profit to be $100, what premium 
amount should it charge? 
The n candidates for a job have been ranked 1, 2,3, ..., . Let X = the rank of a 
randomly selected candidate, so that X has pmf 


x I/n x=1,2,3,...,n 
P)=\ 9 — otherwise 


(this is called the discrete uniform distribution). Compute E(X) and Var(X) 
using the shortcut formula. [Hint: The sum of the first positive integers is 
n(n + 1)/2, whereas the sum of their squares is n(n + 1)(2n + 1)/6.] 

Let X = the outcome when a fair die is rolled once. If before the die is rolled 
you are offered either $100 dollars or h(X) = 350/X dollars, would you accept 
the guaranteed amount or would you gamble? [Hint: Determine E[h(X)], but be 
careful: the mean of 350/X is not 350/p.] 

In the popular game Plinko on The Price Is Right, contestants drop a circular 
disk (a “chip”) down a pegged board; the chip bounces down the board and 
lands in a slot corresponding to one of five dollar mounts. The random variable 
X = winnings from one chip dropped from the middle slot has roughly the 
following distribution. 


x |$0 $100 $500 $1000 $10,000 
px) [39 03 ll .24 23 


(a) Graph the probability mass function of X. 

(b) What is the probability a contestant makes money on a chip? 

(c) What is the probability a contestant makes at least $1000 on a chip? 
(d) Determine the expected winnings. Interpret this number. 

(e) Determine the corresponding standard deviation. 
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40. 


41. 
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A supply company currently has in stock 500 lb of fertilizer, which it sells to 
customers in 10-lb bags. Let X equal the number of bags purchased by a 
randomly selected customer. Sales data shows that X has the following pmf: 


x [1 2 3 4 
pa) [2 


(a) Compute the average number of bags bought per customer. 

(b) Determine the standard deviation for the number of bags bought per 
customer. 

(c) Define Y to be the amount of fertilizer left in stock, in pounds, after the first 
customer. Construct the pmf of Y. 

(d) Use the pmf of Y to find the expected amount of fertilizer left in stock, in 
pounds, after the first customer. 

(e) Write Y as a linear function of X. Then use rescaling properties to find the 
mean and standard deviation of Y. 

(f) The supply company offers a discount to each customer based on the 
formula W = (X — 1 Determine the expected discount for a customer. 

(g) Does your answer in part (f) equal (uy — 1)*? Why or why not? 

(h) Calculate the standard deviation of W. 

Refer back to the roulette scenario in Examples 2.18 and 2.24. Two other ways 

to wager at roulette are betting on a single number, or on a four-number 

“square.” The pmfs for the returns on a $1 wager on a number and a square 

are displayed below. (Payoffs for winning are always based on the odds of 

losing a wager under the assumption the two green spaces didn’t exist.) 

Single number: 


x | _-$1 +$35 
p(x) | 37/38 1/38 
Square: 
x | _-$1 +$8 
p(x) | 34/38 4/38 


(a) Determine the expected return from a $1 wager on a single number, and 
then on a square. 

(b) Compare your answers from (a) to Example 2.18. What can be said about 
the expected return for a $1 wager? Based on this, does expected return 
reflect most players’ intuition that betting on black is “safer” and betting on 
a single number is “riskier”? 

(c) Now calculate the standard deviations for the two pmfs above. 

(d) How do the standard deviations of the three betting schemes (color, single 
number, square) compare? How do these values appear to relate to players’ 
intuitive sense of risk? 


42. 


43. 


44, 


45. 


46. 


47. 


48. 
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(a) Draw a line graph of the pmf of X in Exercise 35. Then determine the pmf 
of —X and draw its line graph. From these two pictures, what can you say 
about Var(X) and Var(—X)? 

(b) Use the proposition involving Var(aX + b) to establish a general relation- 
ship between Var(X) and Var(—X). 

Use the definition of variance to prove that Var(aX +b) =a’ ox. (Hint: From 

Eq. (2.12), axe = apy + b] 

Suppose E(X) = 5 and E[X(X — 1)] = 27.5. 

(a) Determine E(X’). [Hint: E[X(X — 1)] = E(X? — X) = E(X’) — E(X).] 

(b) What is Var(X)? 

(c) What is the general relationship among the quantities E(X), E[X(X — 1)], 
and Var(X)? 

Write a general rule for E(X — c) where c is a constant. What happens when you 

let c = p, the expected value of X? 

Let X be a rv with mean yp. Show that E(X’) > ve and that E(X’) > a unless 

X is a constant. [Hint: Consider variance. ] 

Refer to Chebyshev’s inequality in this section. 

(a) What is the value of the upper bound for k = 2? k = 3? k =4? k = 5? 
k= 10? 

(b) Compute y and o for the distribution of Exercise 13. Then evaluate for the 
values of k given in part (a). What does this suggest about the upper bound 
relative to the corresponding probability? 

(c) Suppose you will win $d if a fair coin flips heads and lose $d if it lands tails. 
Let X be the amount you get from a single coin flip. Compute E(X) and 
SD(X). What is the probability X will be less than one standard deviation 
from its mean value? 

(d) Let X have three possible values, —1, 0, and 1, with probabilities ip 8, and {ik 
respectively. What is P(IX — yl > 30), and how does it compare to the 
corresponding Chebyshev bound? 

(e) Give a distribution for which P(X — pl > 5o0) = .04. 

For a discrete rv X taking values in {0, 1, 2, 3, ...}, we shall derive the 

following alternative formula for the mean: 


He = 1 - FO) 


(a) Suppose for now the range of X is {0, 1, ... N} for some positive integer N. 
By regrouping terms, show that 
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rp) = p(1) + p(2) + p(3) +--- + p(W) 


x=0 
+p(2) + (3) +--+ P(N) 
+p(3) +--+ +p(W) 
+ p(N) 


(b) Rewrite each row in the above expression in terms of the cdf of X, and use 
this to establish that 


N-1 


[1 — FQ)] 


Yo PO] 
x=0 


x x=0 


(c) Let N — oo in part (b) to establish the desired result, and explain why the 
resulting formula works even if the maximum value of X is finite. [Hint: If 
the largest possible value of X is N, what does 1 — F(x) equal for x > N?] 
(This derivation also implies that a discrete rv X has a finite mean iff the 
series >’ [1 — F(«)] converges.) 

(d) Let X have the pmf from Examples 2.10 and 2.19. Use the cdf of X and the 
alternative mean formula just derived to determine ply. 


2.4 The Binomial Distribution 


Many experiments conform either exactly or approximately to the following list of 
requirements: 


1. 


2D. 


The experiment consists of a sequence of 1 smaller experiments called trials, 
where n is fixed in advance of the experiment. 

Each trial can result in one of the same two possible outcomes (dichotomous 
trials), which we denote by success (S) or failure (F’). 


. The trials are independent, so that the outcome on any particular trial does not 


influence the outcome on any other trial. 


. The probability of success is constant from trial to trial (homogeneous trials); we 


denote this probability by p. 


DEFINITION 

An experiment for which Conditions 1-4 are satisfied—a fixed number of 
dichotomous, independent, homogeneous trials—is called a binomial 
experiment. 


Example 2.27 The same coin is tossed successively and independently n times. 
We arbitrarily use § to denote the outcome H (heads) and F to denote the outcome 
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T (tails). Then this experiment satisfies Conditions 1-4. Tossing a thumbtack 
n times, with $ = point up and F = point down, also results in a binomial 
experiment. a 


Some experiments involve a sequence of independent trials for which there are 
more than two possible outcomes on any one trial. A binomial experiment can then 
be created by dividing the possible outcomes into two groups. 


Example 2.28 The color of pea seeds is determined by a single genetic locus. If the 
two alleles at this locus are AA or Aa (the genotype), then the pea will be yellow 
(the phenotype), and if the allele is aa, the pea will be green. Suppose we pair off 
20 Aa seeds and cross the two seeds in each of the ten pairs to obtain ten new 
genotypes. Call each new genotype a success S if it is aa and a failure otherwise. 
Then with this identification of $ and F, the experiment is binomial with n = 10 and 
p = P(aa genotype). If each member of the pair is equally likely to contribute a or A, 
then p = P(a) - P(a) = (1/2)(1/2) = .25. a 


Example 2.29 A student acquaintance of yours has an iPod playlist containing 
50 songs, of which 35 were recorded prior to the year 2010 and the other 15 were 
recorded more recently. Suppose the random play function is used to select five 
from among these 50 songs for listening during a walk between classes. Each 
selection of a song constitutes a trial; regard a trial as a success if the selected 
song was recorded before 2010. Then 


P(S on first trial) = = = .70 


and 


P(S on second trial) = P(SS) + P(FS) 
= P(second S|first S) P (first S) 
+ P(second S|first F) P (first F) 


ae 0 yd (3 i) 35 


= 49°50 49°50 5049 49) 507° 


Similarly, it can be shown that P(S on ith trial) = .70 for i = 3, 4, 5, so the trials 
are homogeneous. However, 


31 


P(S on fifth trial|SSSS) = Vem 


whereas 
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P(S on fifth trial|FFFF) ==> = 76 
The experiment is not binomial because the trials are not independent. In 
general, if sampling is without replacement, the experiment will not yield indepen- 
dent trials. If songs had been selected with replacement, then trials would have been 
independent, but this might have resulted in the same song being listened to more 
than once. a 


Example 2.30 Suppose a state has 500,000 licensed drivers, of whom 400,000 are 
insured. A sample of 10 drivers is chosen without replacement. The ith trial is 
labeled S' if the ith driver chosen is insured. Although this situation would seem 
identical to that of Example 2.29, the important difference is that the size of the 
population being sampled is very large relative to the sample size. In this case 


409,999 
P 2 1) =—_—__w 
(S on 2|S on 1) 499,999 80000 
and 
399,991 
P(S on 10|S on first 9) = 709-991 = -799996 ~ 80000 


These calculations suggest that although the trials are not exactly independent, 
the conditional probabilities differ so slightly from one another that for practical 
purposes the trials can be regarded as independent with constant P(S) = .8. Thus, to 
a very good approximation, the experiment is binomial with n = 10 and p = .8. 


We will use the following convention in deciding whether a “without-replace- 
ment” experiment can be treated as being (approximately) binomial. 


RULE 

Consider sampling without replacement from a dichotomous population of 
size N. If the sample size (number of trials) 1 is at most 5% of the population 
size, the experiment can be analyzed as though it were exactly a binomial 
experiment. 


By “analyzed,” we mean that probabilities based on the binomial experiment 
assumptions will be quite close to the actual “without-replacement” probabilities, 
which are typically more difficult to calculate. In Example 2.29, n/N = 5/50 = .1 > 
.05, so the binomial experiment is not a good approximation, but in Example 2.30, 
n/N = 10/500,000 < .05. 
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2.4.1. The Binomial Random Variable and Distribution 


In most binomial experiments, it is the total number of successes, rather than 
knowledge of exactly which trials yielded successes, that is of interest. 


DEFINITION 
Given a binomial experiment consisting of 7 trials, the binomial random 
variable X associated with this experiment is defined as 

X = the number of successes among the 7 trials 


Suppose, for example, that n = 3. Then there are eight possible outcomes for the 
experiment: 
SSS SSF SFS SFF FSS FSF FFS FFF 


From the definition of X, X(SSF) = 2, X(SFF) = 1, and so on. Possible values for 
X in an n-trial experiment are x = 0, 1, 2,..., 7. 


NOTATION 

We will write X ~ Bin(n, p) to indicate that X is a binomial rv based on » trials 
with success probability p. Because the pmf of a binomial rv X depends on the 
two parameters n and p, we denote the pmf by b(x; n, p). 


Our next goal is to derive a formula for the binomial pmf. Consider first the case 
n = 4 for which each outcome, its probability, and corresponding x value are listed 
in Table 2.1. For example, 


P(SSFS) = P(S)- P(S)-P(F) -P(S) independent trials 
=p-p-(l—p)-p constant P(S) 
=p*-(1—p) 
In this special case, we wish to determine b(a; 4, p) for x = 0, 1, 2, 3, and 4. For 


b(3; 4, p), we identify which of the 16 outcomes yield an x value of 3 and sum the 
probabilities associated with each such outcome: 


b(3;4, p) = P(FSSS) + P(SFSS) + P(SSFS) + P(SSSF) = 4p3(1 — p) 


There are four outcomes with x = 3 and each has probability p*(1 — p); the 
probability depends only on the number of S’s, not the order of S’s and F’s. So 


b(3;4,p) = number of outcomes |_ J probability of any particular 
1" P) =) with X = 3 outcome with X = 3 
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Table 2.1 Outcomes and 


aries enn a Outcome x Probability Outcome x Probability 
t 
= a 11itles ee ae SSSS 4 p* FSSS 3 pd —p) 
periment with four trials 5 2 2 
SSSF 3 pd -—p) FSSF 2” pd —-p) 
SSFS 3. p(l—p) FSFS A | wrd =pyr 
SSFF 2 pd—p)y  FSFF 1 pad—py 
SFSS 3 pd—p) FFSS 2 | ll jaye 
SFSF 2 p(l—p)  FFSF 1 pd-—py 
SFFS 2 p(l—p)  FFFS 1 pd-py 
SFFF 1 pd-py FFFF 0 (1—p)* 


Similarly, b(2; 4, p) = 6p*(1 — p)*, which is also the product of the number of 
outcomes with X = 2 and the probability of any such outcome. 
In general, 


Aan number of sequences of _ J probability of any 
oP) length n consisting of x S’s particular such sequence 


Since the ordering of S’s and F’s is not important, the second factor in the 
previous equation is p*(1 — p)" ~ (for example, the first x trials resulting in S and the 
last n — x resulting in F’). The first factor is the number of ways of choosing x of the 
n trials to be S’s—that is, the number of combinations of size x that can be 
constructed from n distinct objects (trials here). 


THEOREM 


n xX n-xX ines 
b(x;n,p) = (7 )p (Pp) =O yen 
0 otherwise 


Example 2.31 Each of six randomly selected cola drinkers is given a glass 
containing cola S and one containing cola F. The glasses are identical in 
appearance except for a code on the bottom to identify the cola. Suppose there 
is actually no tendency among cola drinkers to prefer one cola to the other. Then 
p = P(a selected individual prefers S$) = .5, so with X = the number among the six 
who prefer S, X ~ Bin(6, .5). 

Thus 


P(X = 3) = b(3;6,.5) = ( 
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The probability that at least three prefer S$ is 


6 : 6 6 b 6-x 
PXS3) => 0G; 6,5)= 2 ( Jos (.5)°"* = .656 


and the probability that at most one prefers S is 


P(X <1) => bees) = .109 
x=0 


2.4.2 Computing Binomial Probabilities 


Even for a relatively small value of n, the computation of binomial probabilities can 
be tedious. Software and statistical tables are both available for this purpose; both 
are often in terms of the cdf F(x) = P(X < x) of the distribution, either in lieu of or in 
addition to the pmf. Various other probabilities can then be calculated using the 
proposition on cdfs from Sect. 2.2. 


NOTATION 
For X ~ Bin(n, p), the cdf will be denoted by 


B(xjn,p) = P(X <x) = Doin) ce 0) le soonll 


Table 2.2 at the end of this section provides the code for performing binomial 
calculations in both Matlab and R. In addition, Appendix Table A.1 tabulates the 
binomial cdf for n = 5, 10, 15, 20, 25 in combination with selected values of p. 


Example 2.32 Suppose that 20% of all copies of a particular textbook fail a 
binding strength test. Let X denote the number among 15 randomly selected copies 
that fail the test. Then X has a binomial distribution with n = 15 and p = .2. 

(a) The probability that at most 8 fail the test is 


8 
P(X <8) = 5 -d(y; 15, .2) = B(8; 15, .2) 


y=0 
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This is found at the intersection of the p = .2 column and x = 8 row in the 
n = 15 part of Table A.1: B(8; 15, .2) = .999. In Matlab, we may type 
cdf (’bin’,8,15,.2);inR, pbinom(8,15,.2). 
(b) The probability that exactly 8 fail is P(X =8) =b(8;15,.2) = 
& ) (.2)8(.8)’ = .0034. We can evaluate this probability in Matlab or R 
with the calls pdf(’bin’,8,15,.2)and dbinom(8,15, .2), respec- 
tively. To use Table A.1, write 


P(X = 8) =P(X <8) — P(X < 7) =B(8; 15, .2) — B(7; 15, .2) 


which is the difference between two consecutive entries in the p = .2 column. 
The result is .999 — .996 = .003. 

(c) The probability that at least 8 fail is P(X > 8) =1— P(X <7)=1—-—B(7; 15,.2). 
The cdf may be evaluated using Matlab or R as above, or by looking up 
the entry in the x = 7 row of the p = .2 column in Table A.1. In any case, we 
find P(X > 8) = 1 — .996 = .004. 

(d) Finally, the probability that between 4 and 7, inclusive, fail is 


P(4<X <7) =P(X =4,5,6,or7) = P(X < 7) — P(X <3) 
= B(7; 15, .2) — B(3; 15,.2) = .996 — .648 = .348 


Notice that this latter probability is the difference between the cdf values at 
x= 7 and x = 3, notx = 7 andx = 4. | 


Example 2.33 An electronics manufacturer claims that at most 10% of its power 
supply units need service during the warranty period. To investigate this claim, 
technicians at a testing laboratory purchase 20 units and subject each one to 
accelerated testing to simulate use during the warranty period. Let p denote the 
probability that a power supply unit needs repair during the period (i.e., the 
proportion of a// such units that need repair). The laboratory technicians must 
decide whether the data resulting from the experiment supports the claim that 
p < .10. Let X denote the number among the 20 sampled that need repair, so 
X ~ Bin(20, p). Consider the decision rule 


Reject the claim that p < .10 in favor of the conclusion that p > .10 if x > 5 
(where x is the observed value of X), and consider the claim plausible if x < 4 


The probability that the claim is rejected when p = .10 (an incorrect conclusion) is 


P(X > 5 when p = .10) = 1 — B(4;20,.1) = 1 — .957 = .043 


2.4 The Binomial Distribution 123 


The probability that the claim is not rejected when p = .20 (a different type of 
incorrect conclusion) is 


P(X < 4 when p = .2) = B(4; 20, .2) = .630 


The first probability is rather small, but the second is intolerably large. When 
p = .20, so that the manufacturer has grossly understated the percentage of units 
that need service, and the stated decision rule is used, 63% of all samples of size 
20 will result in the manufacturer’s claim being judged plausible! 

One might recognize that the probability of this second type of erroneous 
conclusion could be made smaller by changing the cutoff value 5 in the decision 
rule to something else. However, although replacing 5 by a smaller number would 
indeed yield a probability smaller than .630, the other probability would then 
increase. The only way to make both “error probabilities” small is to base the 
decision rule on an experiment involving many more units (i.e., to increase 7). 


2.4.3. The Mean and Variance of a Binomial Random Variable 


For n = 1, the binomial distribution becomes the Bernoulli distribution. From 
Example 2.17, the mean value of a Bernoulli variable is 7 = p, so the expected 
number of S’s on any single trial is p. Since a binomial experiment consists of 
n trials, intuition suggests that for X ~ Bin(n, p), E(X) = np, the product of the 
number of trials and the probability of success on a single trial. The expression for 
Var(X) is not so obvious. 


PROPOSITION 
If X ~ Bin(@m, p), then E(X) = np, Var(X) = np(l — p) = npg, and 
SD(X) = \/npq (where g = 1 — p). 


Thus, calculating the mean and variance of a binomial rv does not necessitate 
evaluating summations of the sort we employed in Sect. 2.3. The proof of the result 
for E(X) is sketched in Exercise 74. 


Example 2.34 If 75% of all purchases at a store are made with a credit card and 
X is the number among ten randomly selected purchases made with a credit card, 
then X ~ Bin(10, .75). Thus E(X) = np = (10)(.75) = 7.5, Var(X) = np — p) = 
10(.75)(.25) = 1.875, and o = V'1.875 = 1.37. Again, even though X can take on 
only integer values, E(X) need not be an integer. If we perform a large number of 
independent binomial experiments, each with n = 10 trials and p = .75, then the 
average number of S’s per experiment will be close to 7.5. = 
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An important application of the binomial distribution is to estimating the 
precision of simulated probabilities, as in Sect. 1.6. The relative frequency defini- 
tion of probability justified defining an estimate of a probability P(A) by 
P(A) =X/n, where n is the number of runs of the simulation program and 
X equals the number of runs in which event A occurred. Assuming the runs of our 
simulation are independent (and they usually are), the rv X has a binomial distribu- 
tion with parameters n and p = P(A). From the preceding proposition and the 
rescaling properties of mean and standard deviation, we have 


~ 1 1 1 
E(P(A)) = (2x) =~ E(X) =~ (np) = p = P(A) 
Thus we expect the value of our estimate to coincide with the probability being 


estimated, in the sense that there is no reason for P(A) to be systematically higher or 
lower than P(A). Also, 


SD(P(A)) = SD (: x) = A - SD(X) 


=) Japp) = poe = PAN Pw) (2.14) 


n 


Expression (2.14) is called the standard error of P(A) (essentially a synonym 
for standard deviation) and indicates the amount by which an estimate P(A) 
“typically” varies from the true probability P(A). However, this expression isn’t 
of much use in practice: we most often simulate a probability when P(A) is 
unknown, which prevents us from using Eq. (2.14). As a solution, we simply 
substitute the estimate P = P(A) into this expression and get 


F P(1—P 

SD(P(A)) a P(l —P) ; ) 

n 

This is the estimated standard error formula (1.8) given in Sect. 1.6. Very impor- 
tantly, this estimated standard error gets closer to 0 as the number of runs, n, in the 
simulation increases. 


2.4.4 Binomial Calculations with Software 


Many software packages, including Matlab and R, have built-in functions to 
evaluate both the pmf and cdf of the binomial distribution (and many other 
named distributions). Table 2.2 summarizes the relevant code in both packages. 
The use of these functions was illustrated in Example 2.32. 
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Table 2.2 Binomial probability calculations in Matlab and R 


Function: pmf cdf 

Notation: D(x; n, p) B(x; n, p) 

Matlab: joehe ((elio.” 3%, 1, (0) Gtehe (outta 3%, Wy jm) 
R: dbinom(x, n, p) pbinom(x, n, p) 


2.4.5 Exercises: Section 2.4 (49-74) 


49. 


50. 


51. 


D2: 


Determine whether each of the following rvs has a binomial distribution. If it 

does, identify the values of the parameters n and p (if possible). 

(a) X = the number of Es in 10 rolls of a fair die 

(b) X = the number of multiple-choice questions a student gets right on a 
40-question test, when each question has four choices and the student is 
completely guessing 

(c) X = the same as (b), but half the questions have four choices and the other 
half have three 

(d) X = the number of women in a random sample of 8 students, from a class 
comprising 20 women and 15 men 

(e) X = the total weight of 15 randomly selected apples 

(f) X = the number of apples, out of a random sample of 15, that weigh more 
than 150 g 

Compute the following binomial probabilities directly from the formula for 

b(x; 1, p): 

(a) b(3; 8, .6) 

(b) b(S5; 8, .6) 

(c) P33 < X <5) when n = 8 and p = .6 

(d) PU < X) whenn = 12 andp=.1 

Use Appendix Table A.1 or software to obtain the following probabilities: 

(a) B(4; 10, .3) 

(b) b(4; 10, .3) 

(c) b(6; 10, .7) 

(d) P(2 < X < 4) when X ~ Bin(10, .3) 

(e) P(2 < X) when X ~ Bin(10, .3) 

(f) P(X < 1) when X ~ Bin(10, .7) 

(g) P(2 <X <6) when X ~ Bin(10, .3) 

When circuit boards used in the manufacture of DVD players are tested, the 

long-run percentage of defectives is 5%. Let X = the number of defective 

boards in a random sample of size n = 25, so X ~ Bin(25, .05). 

(a) Determine P(X < 2). 

(b) Determine P(X > 5). 

(c) Determine P(1 < X < 4). 

(d) What is the probability that none of the 25 boards is defective? 

(e) Calculate the expected value and standard deviation of X. 
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. A company that produces fine crystal knows from experience that 10% of its 

goblets have cosmetic flaws and must be classified as “seconds.” 

(a) Among six randomly selected goblets, how likely is it that only one is a 
second? 

(b) Among six randomly selected goblets, what is the probability that at least 
two are seconds? 

(c) If goblets are examined one by one, what is the probability that at most 
five must be selected to find four that are not seconds? 

Suppose that only 25% of all drivers come to a complete stop at an intersec- 

tion having flashing red lights in all directions when no other cars are visible. 

What is the probability that, of 20 randomly chosen drivers coming to an 

intersection under these conditions, 

(a) At most 6 will come to a complete stop? 

(b) Exactly 6 will come to a complete stop? 

(c) At least 6 will come to a complete stop? 

Refer to the previous exercise. 

(a) What is the expected number of drivers among the 20 that come to a 
complete stop? 

(b) What is the standard deviation of the number of drivers among the 20 that 
come to a complete stop? 

(c) What is the probability that the number of drivers among these 20 that 
come to a complete stop differs from the expected number by more than 
2 standard deviations? 

Suppose that 30% of all students who have to buy a text for a particular course 

want a new copy (the successes!), whereas the other 70% want a used copy. 

Consider randomly selecting 25 purchasers. 

(a) What are the mean value and standard deviation of the number who want a 
new copy of the book? 

(b) What is the probability that the number who want new copies is more than 
two standard deviations away from the mean value? 

(c) The bookstore has 15 new copies and 15 used copies in stock. If 25 people 
come in one by one to purchase this text, what is the probability that all 
25 will get the type of book they want from current stock? [Hint: Let X = 
the number who want a new copy. For what values of X will all 25 get 
what they want?] 

(d) Suppose that new copies cost $100 and used copies cost $70. Assume the 
bookstore has 50 new copies and 50 used copies. What is the expected 
value of total revenue from the sale of the next 25 copies purchased? 
[Hint: Let h(X) = the revenue when X of the 25 purchasers want new 
copies. Express this as a linear function.] 

Exercise 30 (Sect. 2.3) gave the pmf of Y, the number of traffic citations for a 

randomly selected individual insured by a company. What is the probability 

that among 15 randomly chosen such individuals 

(a) At least 10 have no citations? 

(b) Fewer than half have at least one citation? 

(c) The number that have at least one citation is between 5 and 10, inclusive? 
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A particular type of tennis racket comes in a midsize version and an oversize 

version. Sixty percent of all customers at a store want the oversize version. 

(a) Among ten randomly selected customers who want this type of racket, what 
is the probability that at least six want the oversize version? 

(b) Among ten randomly selected customers, what is the probability that the 
number who want the oversize version is within | standard deviation of the 
mean value? 

(c) The store currently has seven rackets of each version. What is the proba- 
bility that all of the next ten customers who want this racket can get the 
version they want from current stock? 

Twenty percent of all telephones of a certain type are submitted for service 

while under warranty. Of these, 60% can be repaired, whereas the other 40% 

must be replaced with new units. If a company purchases ten of these 

telephones, what is the probability that exactly two will end up being replaced 
under warranty? 

The College Board reports that 2% of the two million high school students who 

take the SAT each year receive special accommodations because of 

documented disabilities (Los Angeles Times, July 16, 2002). Consider a random 
sample of 25 students who have recently taken the test. 

(a) What is the probability that exactly 1 received a special accommodation? 

(b) What is the probability that at least 1 received a special accommodation? 

(c) What is the probability that at least 2 received a special accommodation? 

(d) What is the probability that the number among the 25 who received a 
special accommodation is within 2 standard deviations of the number you 
would expect to be accommodated? 

(e) Suppose that a student who does not receive a special accommodation is 
allowed 3 hours for the exam, whereas an accommodated student is 
allowed 4.5 hours. What would you expect the average time allowed the 
25 selected students to be? 

Suppose that 90% of all batteries from a supplier have acceptable voltages. A 

certain type of flashlight requires two type-D batteries, and the flashlight will 

work only if both its batteries have acceptable voltages. Among ten randomly 
selected flashlights, what is the probability that at least nine will work? What 
assumptions did you make in the course of answering the question posed? 

A k-out-of-n system functions provided that at least k of the n components 

function. Consider independently operating components, each of which 

functions (for the needed duration) with probability .96. 

(a) In a 3-component system, what is the probability that exactly two 
components function? 

(b) What is the probability a 2-out-of-3 system works? 

(c) What is the probability a 3-out-of-5 system works? 

(d) What is the probability a 4-out-of-5 system works? 

(e) What does the component probability (previously .96) need to equal so that 
the 4-out-of-5 system will function with probability at least .9999? 


128 


63. 


64. 


65. 


2 Discrete Random Variables and Probability Distributions 


Bit transmission errors between computers sometimes occur, where one com- 
puter sends a 0 but the other computer receives a | (or vice versa). Because of 
this, the computer sending a message repeats each bit three times, so a 0 is sent 
as 000 and a 1 as 111. The receiving computer “decodes” each triplet by 
majority rule: whichever number, 0 or 1, appears more often in a triplet is 
declared to be the intended bit. For example, both 000 and 100 are decoded as 

0, while 101 and 011 are decoded as 1. Suppose that 6% of bits are switched 

(0 to 1, or 1 to 0) during transmission between two particular computers, and 

that these errors occur independently during transmission. 

(a) Find the probability that a triplet is decoded incorrectly by the receiving 
computer. 

(b) Using your answer to part (a), explain how using triplets reduces commu- 
nication errors. 

(c) How does your answer to part (a) change if each bit is repeated five times 
(instead of three)? 

(d) Imagine a 25 kilobit message (i.e., one requiring 25,000 bits to send). What 
is the expected number of errors if there is no bit repetition implemented? If 
each bit is repeated three times? 

A very large batch of components has arrived at a distributor. The batch can be 

characterized as acceptable only if the proportion of defective components is at 

most .10. The distributor decides to randomly select 10 components and to accept 

the batch only if the number of defective components in the sample is at most 2. 

(a) What is the probability that the batch will be accepted when the actual 
proportion of defectives is .01? .05? .10? .20? .25? 

(b) Let p denote the actual proportion of defectives in the batch. A graph of 
P(batch is accepted) as a function of p, with p on the horizontal axis and 
P(batch is accepted) on the vertical axis, is called the operating character- 
istic curve for the acceptance sampling plan. Use the results of part (a) to 
sketch this curve for 0 < p < 1. 

(c) Repeat parts (a) and (b) with “1” replacing “2” in the acceptance 
sampling plan. 

(d) Repeat parts (a) and (b) with “15” replacing “10” in the acceptance 
sampling plan. 

(e) Which of the three sampling plans, that of part (a), (c), or (d), appears most 
satisfactory, and why? 

An ordinance requiring that a smoke detector be installed in all previously 

constructed houses has been in effect in a city for 1 year. The fire department is 

concerned that many houses remain without detectors. Let p = the true 
proportion of such houses having detectors, and suppose that a random sample 

of 25 homes is inspected. If the sample strongly indicates that fewer than 80% 

of all houses have a detector, the fire department will campaign for a mandatory 

inspection program. Because of the costliness of the program, the department 
prefers not to call for such inspections unless sample evidence strongly argues 
for their necessity. Let X denote the number of homes with detectors among the 

25 sampled. Consider rejecting the claim that p > .8 if X < 15. 
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(a) What is the probability that the claim is rejected when the actual value of 
p is .8? 

(b) What is the probability of not rejecting the claim when p = .7? When p = .6? 

(c) How do the “error probabilities” of parts (a) and (b) change if the value 
15 in the decision rule is replaced by 14? 

A toll bridge charges $1.00 for passenger cars and $2.50 for other vehicles. 

Suppose that during daytime hours, 60% of all vehicles are passenger cars. If 

25 vehicles cross the bridge during a particular daytime period, what is the 

resulting expected toll revenue? [Hint: Let X = the number of passenger cars; 

then the toll revenue (X) is a linear function of X.] 

A student who is trying to write a paper for a course has a choice of two topics, 

A and B. If topic A is chosen, the student will order two books through 

interlibrary loan, whereas if topic B is chosen, the student will order four 

books. The student believes that a good paper necessitates receiving and 
using at least half the books ordered for either topic chosen. If the probability 
that a book ordered through interlibrary loan actually arrives in time is .9 and 
books arrive independently of one another, which topic should the student 
choose to maximize the probability of writing a good paper? What if the arrival 

probability is only .5 instead of .9? 

Twelve jurors are randomly selected from a large population. Each juror arrives 

at her or his conclusion about the case before the jury independently of the 

other jurors. 

(a) In a criminal case, all 12 jurors must agree on a verdict. Let p denote the 
probability that a randomly selected member of the population would reach 
a guilty verdict based on the evidence presented (so a proportion 1 — p 
would reach “not guilty”). What is the probability, in terms of p, that the 
jury reaches a unanimous verdict one way or the other? 

(b) For what values of p is the probability in part (a) the highest? For what 
value of p is the probability in (a) the lowest? Explain why this makes 
sense. 

(c) In most civil cases, only a nine-person majority is required to decide a 
verdict. That is, if nine or more jurors favor the plaintiff, then the plaintiff 
wins; if at least nine jurors side with the defendant, then the defendant wins. 
Let p denote the probability that someone would side with the plaintiff 
based on the evidence. What is the probability, in terms of p, that the jury 
reaches a verdict one way or the other? How does this compare with your 
answer to part (a)? 

Customers at a gas station pay with a credit card (A), debit card (B), or cash (C). 

Assume that successive customers make independent choices, with P(A) = .5, 

P(B) = .2, and P(C) = .3. 

(a) Among the next 100 customers, what are the mean and variance of the 
number who pay with a debit card? Explain your reasoning. 

(b) Answer part (a) for the number among the 100 who don’t pay with cash. 

An airport limousine can accommodate up to four passengers on any one trip. 

The company will accept a maximum of six reservations for a trip, and a 
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passenger must have a reservation. From previous records, 20% of all those 

making reservations do not appear for the trip. In the following questions, 

assume independence, but explain why there could be dependence. 

(a) If six reservations are made, what is the probability that at least one 
individual with a reservation cannot be accommodated on the trip? 

(b) If six reservations are made, what is the expected number of available 
places when the limousine departs? 

(c) Suppose the probability distribution of the number of reservations made is 
given in the accompanying table. 


Number of reservations | 3 4 5 6 
Probability | 1 


Let X denote the number of passengers on a randomly selected trip. Obtain 
the probability mass function of X. 

Let X be a binomial random variable with fixed n. 

(a) Are there values of p (0 < p < 1) for which Var(X) = 0? Explain why this 
is So. 

(b) For what value of p is Var(X) maximized? [Hint: Either graph Var(X) as a 
function of p or else take a derivative.] 

(a) Show that b(x; n, 1 — p) = b(n — x; n, p). 

(b) Show that BQ; n, 1 — p) = 1 — B(n — x — 1; n, p). [Hint: At most x S’s is 
equivalent to at least (n — x) F’s.] 

(c) What do parts (a) and (b) imply about the necessity of including values of 
p greater than .5 in Table A.1? 

Refer to Chebyshev’s inequality given in Sect. 2.3. Calculate P(IX — pl > ko) 

for k = 2 and k = 3 when X ~ Bin(20, .5), and compare to the corresponding 

upper bounds. Repeat for X ~ Bin(20, .75). 

Show that E(X) = np when X is a binomial random variable. [Hint: Express 

E(X) as a sum with lower limit x = 1. Then factor out np, let y = x — 1 so 

that the sum is from y = 0 to y= n — 1, and show that the sum equals 1.] 


The Poisson Distribution 


The binomial distribution was derived by starting with an experiment consisting of 
trials and applying the laws of probability to various outcomes of the experiment. 
There is no simple experiment on which the Poisson distribution is based, although 
we will shortly describe how it can be obtained from the binomial distribution by 
certain limiting operations. 
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DEFINITION 
A random variable X is said to have a Poisson distribution with parameter 
Ht (u > O) if the pmf of X is 


Cua 


p(x; p) = se =O 1S oo 


We shall see shortly that w is in fact the expected value of X, so the notation 
here is consistent with our previous use of the symbol pv. Because ~ must be 
positive, p(x; ) > 0 for all possible x values. The fact that © So p@w)=1 isa 
consequence of the Maclaurin infinite series expansion of e“, which appears in most 
calculus texts: 


we we oo 
Ho eas easy Pe a — 
ee eee pe (2.15) 


tad 


If the two extreme terms in Eq. (2.15) are multiplied by e “ and then e™“ is 
placed inside the summation, the result is 


Co 


e Kur 
= a x! 


x=0 


which shows that p(x; ) fulfills the second condition necessary for specifying 
a pmf. 


Example 2.35 Let X denote the number of creatures of a particular type captured 
in a trap during a given time period. Suppose that X has a Poisson distribution with 
pw = 4.5, so on average traps will contain 4.5 creatures. [The article “Dispersal 
Dynamics of the Bivalve Gemma gemma in a Patchy Environment” (Ecol. 
Monogr., 1995: 1-20) suggests this model; the bivalve Gemma gemma is a small 
clam.] The probability that a trap contains exactly five creatures is 


4.5? As 
ae ar = .7029 
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2.5.1. The Poisson Distribution as a Limit 


The rationale for using the Poisson distribution in many situations is provided by 
the following proposition. 


PROPOSITION 
Suppose that in the binomial pmf b(x; n, p) we let n — oo and p — Oin sucha 
way that np approaches a value vw > 0. Then b(x; n, p) > p(x; pL). 


Proof Begin with the binomial pmf: 


Now multiply both the numerator and denominator by n’: 


nn-1 n—x+1 (np) (1—p)" 
non n x! (1—py 


b(x;n,p) = 


Taking the limit as 7 — oo and p — 0 with np — np, 


x _ n 
lim b(x;n,p) =1-1--- J (im ora 
ca x! 


The limit on the right can be obtained from the calculus theorem that says the 
limit of (1 — a,/n)" is e~“* if a, — a. Because np > p, 


lim b(x;n,p) ="_- lim (1 
Xx 


n—-00 1 n-00 


‘?) n _ pve # 


; a POSH) = 


According to the proposition, in any binomial experiment for which n is large 
and p is small, b(x; n, p) © p(x; 4) where uw = np. It is interesting to note that Siméon 
Poisson discovered the distribution that bears his name by this approach in the 
1830s. 

Table 2.3 shows the Poisson distribution for « = 3 along with three binomial 
distributions with np = 3, and Fig. 2.8 (from R) plots the Poisson along with the first 
two binomial distributions. The approximation is of limited use for n = 30, but of 
course the accuracy is better for n = 100 and much better for n = 300. 
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Table 2.3 Comparing the Poisson and three binomial distributions 


x i — 309 — a n= 100, p = .03 n= 300, p = .01 Poisson, ~ = 3 
0 0.042391 0.047553 0.049041 0.049787 
1 0.141304 0.147070 0.148609 0.149361 
2 0.227656 0.225153 0.224414 0.224042 
3 0.236088 0.227474 0.225170 0.224042 
4 0.177066 0.170606 0.168877 0.168031 
5 0.102305 0.101308 0.100985 0.100819 
6 0.047363 0.049610 0.050153 0.050409 
7 0.018043 0.020604 0.021277 0.021604 
8 0.005764 0.007408 0.007871 0.008102 
9 0.001565 0.002342 0.002580 0.002701 
10 0.000365 0.000659 0.000758 0.000810 
P(x) 
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Fig. 2.8 Comparing a Poisson and two binomial distributions 


Example 2.36 Suppose you have a 4-megabit modem (4,000,000 bits/s) with bit 
error probability 10~*. Assume bit errors occur independently, and assume your bit 
rate stays constant at 4 Mbps. What is the probability of exactly 3 bit errors in the 
next minute? Of at most 3 bit errors in the next minute? 

Define a random variable X = the number of bit errors in the next minute. From 
the description, X satisfies the conditions of a binomial distribution; specifically, 
since a constant bit rate of 4 Mbps equates to 240,000,000 bits transmitted per 
minute, X ~ Bin(240000000, 10~*). Hence, the probability of exactly three bit 
errors in the next minute is 


P(X = 3) = b(3;240000000, 10-*) = Goug (10-®)°(1 — 10-8)?" 
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For a variety of reasons, some calculators will struggle with this computation. 
The expression for the chance of at most 3 bit errors, P(X < 3), is even worse. (The 
inability to compute such expressions in the nineteenth century, even with modest 
values of n and p, was Poisson’s motive to derive an easily computed 
approximation.) 

We may approximate these probabilities using the Poisson distribution with 
= np = 240000000(10~8) = 2.4. Then 


282 
P(X = 3) © p(3:2.4) = ie = .20901416 
Similarly, the probability of at most 3 bit errors in the next minute is 
approximated by 


: ay ge 
P(X <3) = S$" p(x,2.4) = 5° a = 77872291 
x=0 x=0 


Using modern software, the exact probabilities (i.e., using the binomial model) 
are .2090141655 and .7787229 106, respectively. The Poisson approximations agree 
to eight decimal places and are clearly more computationally tractable. a 


Many software packages will compute both p(x; ) and the corresponding cdf 
P(x; p) for specified values of x and w upon request; the relevant Matlab and R 
functions appear in Table 2.4 at the end of this section. Appendix Table A.2 exhibits 
the cdf PQ; w) for w=.1,.2,...,1,2,..., 10, 15, and 20. For example, if ~ = 2, then 
P(X < 3) = P(3; 2) = .857, whereas P(X = 3) = P(3; 2) — P(2; 2) = .180. 


2.5.2. The Mean and Variance of a Poisson Random Variable 


Since b(x; n, p) > p(s #) as n — oo, p — 0, np — p, one might guess that the mean 
and variance of a binomial variable approach those of a Poisson variable. These 
limits are np — pw and np(1 — p) > pw. 


PROPOSITION 
If X has a Poisson distribution with parameter y, then E(X) = Var(X) = p. 


These results can also be derived directly from the definitions of mean and 
variance (see Exercise 88 for the mean). 


Example 2.37 (Example 2.35 continued) Both the expected number of creatures 
trapped and the variance of the number trapped equal 4.5, and 


ox = ff = V45 = 2.12. 7 
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2.5.3. The Poisson Process 


A very important application of the Poisson distribution arises in connection with 
the occurrence of events of a particular type over time. As an example, suppose that 
starting from a time point that we label t = 0, we are interested in counting the 
number of radioactive pulses recorded by a Geiger counter. If we make certain 
assumptions” about the way in which pulses occur—chiefly, that the number of 
pulses grows roughly linearly with time—then it can be shown that the number of 
pulses in any time interval of length ¢ can be modeled by a Poisson distribution with 
mean yu = At for an appropriate positive constant A. Since the expected number of 
pulses in an interval of length ¢ is At, the expected number in an interval of length 
1 is A. Thus / is the long run number of pulses per unit of time. 

If we replace “pulse” by “event,” then the number of events occurring during a 
fixed time interval of length ¢ has a Poisson distribution with parameter At. Any 
process that has this distribution is called a Poisson process, and J is called the rate 
of the process. Other examples of situations giving rise to a Poisson process include 
monitoring the status of a computer system over time, with breakdowns constituting 
the events of interest; recording the number of accidents in an industrial facility 
over time; answering 911 calls at a particular location; and observing the number of 
cosmic-ray showers from an observatory. 

Example 2.36 hints at why this might be reasonable: if we “digitize” time—that 
is, divide time into discrete pieces, such as transmitted bits—and look at the number 
of the resulting time pieces that include an event, a binomial model is often 
applicable. If the number of time pieces is very large and the success probability 
close to zero, which would occur if we divided a fixed time frame into ever-smaller 
pieces, then we may invoke the Poisson approximation from earlier in this section. 


Example 2.38 Suppose pulses arrive at the Geiger counter at an average rate of 
6 per minute, so that 2 = 6. To find the probability that in a 30-s interval at least one 
pulse is received, note that the number of pulses in such an interval has a Poisson 
distribution with parameter At = 6(.5) = 3 (.5 min is used because / is expressed as a 
rate per minute). Then with X = the number of pulses received in the 30-s interval, 


P(X >1)=1-P(X=0)=1- 


In a 1-h interval (t = 60), the expected number of pulses is yp = At = 6(60) = 
360, with a standard deviation of o = JH — /360 = 18.97. According to this 
model, in a typical hour we will observe 360 + 19 pulses arrive at the Geiger 
counter. a 


?TIn Sect. 7.5, we present the formal assumptions required in this situation and derive the Poisson 
distribution that results from these assumptions. 
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Instead of observing events over time, consider observing events of some type 
that occur in a two- or three-dimensional region. For example, we might select on a 
map a certain region R of a forest, go to that region, and count the number of trees. 
Each tree would represent an event occurring at a particular point in space. Under 
appropriate assumptions (see Sect. 7.5), it can be shown that the number of events 
occurring in a region R has a Poisson distribution with parameter A - a(R), where 
a(R) is the area of R. The quantity 4 is the expected number of events per unit area 
or volume. 


2.5.4 Poisson Calculations with Software 


Table 2.4 gives the Matlab and R commands for calculating Poisson probabilities. 


Table 2.4 Poisson 


Function: pmf cdf 

probability calculations INGEN Sarre p(x; p4) P(x) 
Matlab: pdf (’pois’ ,x,p) Gishe (joes? ,, 28,7) 
R: dpois (x, ) ppois (x, f) 


2.5.5 Exercises: Section 2.5 (75-89) 


75. 


76. 


77. 


Let X, the number of flaws on the surface of a randomly selected carpet of a 

particular type, have a Poisson distribution with parameter ~ = 5. Use soft- 

ware or Appendix Table A.2 to compute the following probabilities: 

(a) P(X <8) 

(b) P(X=8) 

(c) PO <X) 

(d) PS <X <8) 

(e) Pb<X <8) 

Let X be the number of material anomalies occurring in a particular region of 

an aircraft gas-turbine disk. The article “Methodology for Probabilistic Life 

Prediction of Multiple-Anomaly Materials” (Amer. Inst. of Aeronautics and 

Astronautics J., 2006: 787-793) proposes a Poisson distribution for X. Sup- 

pose vp = 4. 

(a) Compute both P(X < 4) and P(X < 4). 

(b) Compute P(4 < X < 8). 

(c) Compute P(8 < X). 

(d) What is the probability that the observed number of anomalies exceeds the 
expected number by no more than one standard deviation? 

Suppose that the number of drivers who travel between a particular origin and 

destination during a designated time period has a Poisson distribution with 

parameter 4 = 20 (suggested in the article “Dynamic Ride Sharing: Theory 
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78. 


79. 


80. 


81. 


82. 


83. 


and Practice,” J. of Transp. Engr., 1997: 308-312). What is the probability 

that the number of drivers will 

(a) Be at most 10? 

(b) Exceed 20? 

(c) Be between 10 and 20, inclusive? Be strictly between 10 and 20? 

(d) Be within 2 standard deviations of the mean value? 

Consider writing onto a computer disk and then sending it through a certifier 

that counts the number of missing pulses. Suppose this number X has a 

Poisson distribution with parameter « = .2. (Suggested in “Average Sample 

Number for Semi-Curtailed Sampling Using the Poisson Distribution,” 

J. Qual. Tech., 1983: 126-129.) 

(a) What is the probability that a disk has exactly one missing pulse? 

(b) What is the probability that a disk has at least two missing pulses? 

(c) If two disks are independently selected, what is the probability that neither 
contains a missing pulse? 

An article in the Los Angeles Times (Dec. 3, 1993) reports that 1 in 200 people 

carry the defective gene that causes inherited colon cancer. In a sample of 1000 

individuals, what is the approximate distribution of the number who carry this 

gene? Use this distribution to calculate the approximate probability that 

(a) Between 5 and 8 (inclusive) carry the gene. 

(b) At least 8 carry the gene. 

Suppose that only .10% of all computers of a certain type experience CPU 

failure during the warranty period. Consider a sample of 10,000 computers. 

(a) What are the expected value and standard deviation of the number of 
computers in the sample that have the defect? 

(b) What is the (approximate) probability that more than 10 sampled 
computers have the defect? 

(c) What is the (approximate) probability that no sampled computers have the 
defect? 

If a publisher of nontechnical books takes great pains to ensure that its books 

are free of typographical errors, so that the probability of any given page 

containing at least one such error is .005 and errors are independent from page 
to page, what is the probability that one of its 400-page novels will contain 
exactly one page with errors? At most three pages with errors? 

In proof testing of circuit boards, the probability that any particular diode will 

fail is .01. Suppose a circuit board contains 200 diodes. 

(a) How many diodes would you expect to fail, and what is the standard 
deviation of the number that are expected to fail? 

(b) What is the (approximate) probability that at least four diodes will fail on 
a randomly selected board? 

(c) If five boards are shipped to a particular customer, how likely is it that at 
least four of them will work properly? (A board works properly only if all 
its diodes work.) 

The article “Expectation Analysis of the Probability of Failure for Water Supply 

Pipes” (J. Pipeline Syst. Eng. Pract. 2012.3:36-46) recommends using a 
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84. 


85. 


86. 


87. 
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Poisson process to model the number of failures in commercial water pipes. The 

article also gives estimates of the failure rate A, in units of failures per 100 miles 

of pipe per day, for four different types of pipe and for many different years. 

(a) For PVC pipe in 2008, the authors estimate a failure rate of 0.0081 failures 
per 100 miles of pipe per day. Consider a 100-mile-long segment of such 
pipe. What is the expected number of failures in 1 year (365 days)? Based 
on this expectation, what is the probability of at least one failure along 
such a pipe in | year? 

(b) For cast iron pipe in 2005, the authors’ estimate is 2 = 0.0864 failures per 
100 miles per day. Suppose a town had 1500 miles of cast iron pipe 
underground in 2005. What is the probability of at least one failure 
somewhere along this pipe system on any given day? 

Organisms are present in ballast water discharged from a ship according to a 

Poisson process with a concentration of 10 organisms/m/ (the article “Counting 

at Low Concentrations: The Statistical Challenges of Verifying Ballast Water 

Discharge Standards” (Ecological Applications, 2013: 339-351) considers 

using the Poisson process for this purpose). 

(a) What is the probability that one cubic meter of discharge contains at least 
8 organisms? 

(b) What is the probability that the number of organisms in 1.5 m? of discharge 
exceeds its mean value by more than one standard deviation? 

(c) For what amount of discharge would the probability of containing at least 
one organism be .999? 

Suppose small aircraft arrive at an airport according to a Poisson process with 

rate A = 8 per hour, so that the number of arrivals during a time period of f hours 

is a Poisson rv with parameter p = 8r. 

(a) What is the probability that exactly 6 small aircraft arrive during a 1-h 
period? At least 6? At least 10? 

(b) What are the expected value and standard deviation of the number of small 
aircraft that arrive during a 90-min period? 

(c) What is the probability that at least 20 small aircraft arrive during a 2.5-h 
period? That at most 10 arrive during this period? 

The number of people arriving for treatment at an emergency room can be 

modeled by a Poisson process with a rate parameter of five per hour. 

(a) What is the probability that exactly four arrivals occur during a particular 
hour? 

(b) What is the probability that at least four people arrive during a particular 
hour? 

(c) How many people do you expect to arrive during a 45-min period? 

Suppose that trees are distributed in a forest according to a two-dimensional 

Poisson process with rate A, the expected number of trees per acre, equal to 80. 

(a) What is the probability that in a certain quarter-acre plot, there will be at 
most 16 trees? 

(b) If the forest covers 85,000 acres, what is the expected number of trees in the 
forest? 
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(c) Suppose you select a point in the forest and construct a circle of radius .1 
mile. Let X = the number of trees within that circular region. What is the 
pmf of X? [Hint: 1 sq mile = 640 acres.] 

88. Let X have a Poisson distribution with parameter py. Show that E(X) = p directly 
from the definition of expected value. [Hint: The first term in the sum equals 

0, and then x can be canceled. Now factor out 4 and show that what is left sums 

to 1.] 

89. In some applications the distribution of a discrete rv X resembles the Poisson 
distribution except that zero is not a possible value of X. For example, let X = 

the number of tattoos that an individual wants removed when s/he arrives at a 

tattoo removal facility. Suppose the pmf of X is 


(a) Determine the value of k. [Hint: The sum of all probabilities in the Poisson 
pmf is 1, and this pmf must also sum to 1.] 

(b) If the mean value of X is 2.313035, what is the probability that an individ- 
ual wants at most 5 tattoos removed? 

(c) Determine the standard deviation of X when the mean value is as given in 
(b). 

[Note: The article “An Exploratory Investigation of Identity Negotiation and 
Tattoo Removal” (Academy of Marketing Science Review, vol. 12, #6, 2008) 
gave a sample of 22 observations on the number of tattoos people wanted 
removed; estimates of ~ and o calculated from the data were 2.318182 and 
1.249242, respectively. ] 


2.6 Other Discrete Distributions 


The hypergeometric and negative binomial distributions are both closely related to 
the binomial distribution. Whereas the binomial distribution is the approximate 
probability model for sampling without replacement from a finite dichotomous 
(S-F’) population, the hypergeometric distribution is the exact probability model for 
the number of S’s in the sample. The binomial rv X is the number of S’s when the 
number n of trials is fixed, whereas the negative binomial distribution arises from 
fixing the number of S’s desired and letting the number of trials be random. 


2.6.1 The Hypergeometric Distribution 


The assumptions leading to the hypergeometric distribution are as follows: 
1. The population or set to be sampled consists of N individuals, objects, or 
elements (a finite population). 
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2. Each individual can be characterized as a success (S) or a failure (F’), and there 
are M successes in the population. 

3. A sample of n individuals is selected without replacement in such a way that 
each subset of size n is equally likely to be chosen. 


The random variable of interest is X = the number of S’s in the sample. The 
probability distribution of X depends on the parameters n, M, and N, so we wish to 
obtain P(X = x) = h(x; n, M,N). 


Example 2.39 During a particular period a university’s information technology 
office received 20 service orders for problems with laptops, of which 8 were Macs 
and 12 were PCs. A sample of five of these service orders is to be selected for 
inclusion in a customer satisfaction survey. Suppose that the five are selected in a 
completely random fashion, so that any particular subset of size 5 has the same 
chance of being selected as does any other subset (think of putting the numbers 1, 2, 
... , 20 on 20 identical slips of paper, mixing up the slips, and choosing five of 
them). What then is the probability that exactly 2 of the selected service orders were 
for PC laptops? 

In this example, the population size is N = 20, the sample size is n = 5, and the 
number of S’s (PC = S) and F’s (Mac = F) in the population are M = 12 andN — M= 8, 
respectively. Let X = the number of PCs among the five sampled service orders. 
Because all outcomes (each consisting of five particular orders) are equally likely, 


P(X = 2) = h(2;5, 12,20) = number of outcomes having X = 2 


number of possible outcomes 


The number of possible outcomes in the experiment is the number of ways of 


selecting 5 from the 20 objects without regard to order—that is, (2) . To count the 


5 


number of outcomes having X = 2, note that there are ( a ways of selecting two 


of the PC orders, and for each such way there are (5 
Mac orders to fill out the sample. The Fundamental Counting Principle from 


Sect. 1.3 then gives & . (5) as the number of outcomes with X = 2, so 


( 2 ) ( ; ) 
2 3 77 
h(2;5, 12,20) = = = .238 
( ’ 2 o] ) 20 323 
5 H 
In general, if the sample size n is smaller than the number of successes in the 


population (M), then the largest possible X value is n. However, if M <n (e.g., a 
sample size of 25 and only 15 successes in the population), then X can be at most M. 


) ways of selecting the three 
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Similarly, whenever the number of population failures (V — M) exceeds the sample 
size, the smallest possible X value is 0 (since all sampled individuals might then be 
failures). However, if N — M <n, the smallest possible X value is n — (VN — M). 
Summarizing, the possible values of X satisfy the restriction max(0,n — N+ M) < 
x < min(n, M). An argument parallel to that of the previous example gives the pmf 
of X. 


PROPOSITION 

If X is the number of S’s in a random sample of size n drawn from a 
population consisting of M S’s and (VN — M) F’s, then the probability 
distribution of X, called the hypergeometric distribution, is given by 


P(X =x) =A(x;n,M,N) = oe (2.16) 


for x an integer satisfying max(0, n — N + M) < x < mini, M).3 


In Example 2.39, n = 5, M = 12, and N = 20, so A(x; 5, 12, 20) for x = 0, 1, 2, 3, 
4, 5 can be obtained by substituting these numbers into Eq. (2.16). 


Example 2.40 Capture-recapture. Five individuals from an animal population 
thought to be near extinction in a region have been caught, tagged, and released to 
mix into the population. After they have had an opportunity to mix, a random 
sample of 10 of these animals is selected. Let X = the number of tagged animals in 
the second sample. If there are actually 25 animals of this type in the region, what is 
the probability that (a) X = 2? (b) X < 2? 

Application of the hypergeometric distribution here requires assuming that every 
subset of ten animals has the same chance of being captured. This in turn implies 
that released animals are no easier or harder to catch than are those not initially 
captured. Then the parameter values are n = 10, M = 5 (five tagged animals in the 
population), and N = 25, so 


5 20 
x 10—x 
h(x; 10,5, 25) = ~~~ x = 0,1,2,3,4,5 


3 If we define (*) = 0 fora < b, then h(x; n, M, N) may be applied for all integers 0 < x < n. 
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For part (a), 
(3) (3) 
2 8 
P(X = 2) = A(2;10,5,25) = ~~——~ = .385 


For part (b), 


2 
P(X < 2)= P(X =0,1, or 2) = S“h(x; 10,5, 25) 
x=0 
= .057 + .257 + .385 = .699 2 


Matlab, R, and other software packages will easily generate hypergeometric 
probabilities; see Table 2.5 at the end of this section. Comprehensive tables of the 
hypergeometric distribution are available, but because the distribution has three 
parameters, these tables require much more space than tables for the binomial 
distribution. 

As in the binomial case, there are simple expressions for E(X) and Var(X) for 
hypergeometric rvs. 


PROPOSITION 
The mean and variance of the hypergeometric rv X having pmf h(x; n, M, N) are 


E(x) =n. Var(X) = (=) aC x 


N-1 N N 


The ratio M/N is the proportion of S’s in the population. Replacing M/N by p in 
E(X) and Var(X) gives 


E(X) = np (2.17) 


Expression (2.17) shows that the means of the binomial and hypergeometric rvs 
are equal, whereas the variances of the two rvs differ by the factor (NV — n)/(N — 1), 
often called the finite population correction factor. This factor is less than 1, so 
the hypergeometric variable has smaller variance than does the binomial rv. The 
correction factor can be written (1 — n/N)/(1 — 1/N), which is approximately 
1 when v is small relative to N. 
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Example 2.41 (Example 2.40 continued) In the animal-tagging example, n = 
10, M =5, and N = 25, sop = = .2 and 


25 — 10 


Var(X) = S21 


(10)(.2)(.8) = (.625)(1.6) = 1 


If the sampling were carried out with replacement, Var(X) = 1.6. 

Suppose the population size N is not actually known, so the value x is observed 
and we wish to estimate N. It is reasonable to equate the observed sample propor- 
tion of $’s, x/n, with the population proportion, M/N, giving the estimate 


M-n 


xX 


N= 


For example, if M = 100, n = 40, and x = 16, then N = 250. | 


Our rule in Sect. 2.4 stated that if sampling is without replacement but n/N is at 
most .05, then the binomial distribution can be used to compute approximate 
probabilities involving the number of S’s in the sample. A more precise statement 
is as follows: Let the population size, N, and number of population S’s, M, get 
large with the ratio M/N approaching p. Then h(x; n, M, N) approaches b(x; n, p); so 
for n/N small, the two are approximately equal provided that p is not too near either 
0 or 1. This is the rationale for our rule. 


2.6.2. The Negative Binomial and Geometric Distributions 


The negative binomial distribution is based on an experiment satisfying the follow- 

ing conditions: 

1. The experiment consists of a sequence of independent trials. 

2. Each trial can result in either a success (S$) or a failure (F). 

3. The probability of success is constant from trial to trial, so P(S on trial i) = p for 
b= 1,25. 3-203. 

4. The experiment continues (trials are performed) until a total of r successes has 
been observed, where r is a specified positive integer. 


The random variable of interest is X = the number of trials required to achieve 
the rth success, and X is called a negative binomial random variable. In contrast 
to the binomial rv, the number of successes is fixed and the number of trials is 
random. Possible values of X arer,r+1,7r+2,..., since it takes at least r trials to 
achieve r successes. 
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Let nb(x; r, p) denote the pmf of X. The event {X = x} is equivalent to {r — 1 S’s 
in the first (x — 1) trials and an S on the xth trial}, e.g., if 7 = 5 and x = 15, then there 
must be four S’s in the first 14 trials and trial 15 must be an S. Since trials are 
independent, 


nb(x;r, p) = P(X = x) = P(r — 1S’s on the first x — 1 trials). P(S) (2.18) 


The first probability on the far right of Eq. (2.18) is the binomial probability 
(; i ra =p) 0) where P(S) = p 


Simplifying and then multiplying by the extra factor of p at the end of Eq. (2.18) 
yields the following. 


PROPOSITION 

The pmf of the negative binomial rv X with parameters r = desired number of 

S’s and p = P(S) is 
nbtasrsp) = (> 


1 jer —py"” x=r,r+1,r+2,... 


Example 2.42 A pediatrician wishes to recruit four couples, each of whom 
is expecting their first child, to participate in a new natural childbirth regimen. 
Let p = P(a randomly selected couple agrees to participate). If p = .2, what is the 
probability that exactly 15 couples must be asked before 4 are found who agree to 
participate? Substituting r = 4, p = .2, and x = 15 into nb; r, p) gives 


nb(15;4,2) = ee = .050 


The probability that at most 15 couples need to be asked is 


15 15 a 
P(X < 15) = > nb(x;4,.2) =) > G - ae = 352 = 


x=4 x=4 


In the special case r = 1, the pmf is 
nb(x;1,p)=(1—py"'p x=1,2,... (2.19) 


In Example 2.10, we derived the pmf for the number of trials necessary to 
obtain the first $, and the pmf there is identical to Eq. (2.19). The variable X = 
number of trials required to achieve one success is referred to as a geometric 
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random variable, and the pmf in Eq. (2.19) is called the geometric distribution. 
The name is appropriate because the probabilities constitute a geometric series: 
p,(—p)p,a- py, .... To see that the sum of the probabilities is 1, recall that the 
sum of a geometric series is a + ar + ar +... = a/( — r) if Irl < 1, so for p > 0, 


ee 
L=(1 =p) 


In Example 2.19, the expected number of trials until the first § was shown to be 
1/p. Intuitively, we would then expect to need r - 1/p trials to achieve the rth S, and 
this is indeed E(X). There is also a simple formula for Var(X). 


p+(1—p)p+(1—p) p+: =1 


PROPOSITION 
If X is a negative binomial rv with parameters r and p, then 


E(X) =, Var(X) = ee 


Example 2.43 (Example 2.42 continued) With p = .2, the expected number 
of couples the doctor must speak to in order to find 4 that will agree to participate 
is r/p = 4/.2 = 20. This makes sense, since with p = .2 = 1/5 it will take 
five attempts, on average, to achieve one success. The corresponding variance is 
41 — 2.2) = 80, for a standard deviation of about 8.9. | 


Since they are based on similar experiments, some caution must be taken to 
distinguish the binomial and negative binomial models, as seen in the next example. 


Example 2.44 In many communication systems, a receiver will send a short signal 
back to the transmitter to indicate whether a message has been received correctly or 
with errors. (These signals are often called an acknowledgement and a non-acknowl- 
edgement, respectively. Bit sum checks and other tools are used by the receiver to 
determine the absence or presence of errors.) Assume we are using such a system in a 
noisy channel, so that each message is sent error-free with probability .86, indepen- 
dent of all other messages. What is the probability that in 10 transmissions, exactly 
8 will succeed? What is the probability the system will require exactly 10 attempts to 
successfully transmit 8 messages? 

While these two questions may sound similar, they require two different models 
for solution. To answer the first question, let X = the number of successful 
transmissions out of 10. Then X ~ Bin(10, .86), and the answer is 


10 


P(X = 8) = b(8; 10, .86) = ( : 


) (86)*(14) = .2639 
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However, the event {exactly 10 attempts required to successfully transmit 
8 messages} is more restrictive: not only must we observe 8 S’s and 2 F’s in 
10 trials, but the last trial must be a success. Otherwise, it took fewer than 10 tries to 
send 8 messages successfully. Define a variable Y = the number of transmissions 
(trials) required to successfully transmit 8 messages. Then Y is negative binomial, 
with r = 8 and p = .86, and the answer to the second question is 


10-1 


P(Y = 10) = nb(10;8, 86) = ( ae 


) (86)%(14) 2011 

Notice this is smaller than the answer to the first question, which makes sense 
because (as we noted) the second question imposes an additional constraint. In fact, 
you can think of the “—1” terms in the negative binomial pmf as accounting for this 
loss of flexibility in the placement of S’s and F’s. 

Similarly, the expected number of successful transmissions in 10 attempts is 
E(X) = np = 10(.86) = 8.6, while the expected number of attempts required to 
successfully transmit 8 messages is E(Y) = r/p = 8/.86 = 9.3. In the first case, the 


number of trials (7 = 10) is fixed, while in the second case the desired number of 
successes (1 = 8) is fixed. | 


By expanding the binomial coefficient in front of p’(1 — p)*” and doing some 
cancellation, it can be seen that nb(x; r, p) is well-defined even when r is not an 
integer. This generalized negative binomial distribution has been found to fit 
observed data quite well in a wide variety of applications. 


2.6.3 Alternative Definition of the Negative Binomial Distribution 


There is not universal agreement on the definition of a negative binomial random 
variable (or, by extension, a geometric rv). It is not uncommon in the literature, as 
well as in some textbooks, to see the number of failures preceding the rth success 
called “negative binomial”; in our notation, this simply equals X — r. Possible 
values of this “number of failures” variable are 0, 1, 2, .... Similarly, the geometric 
distribution is sometimes defined in terms of the number of failures preceding the 
first success in a sequence of independent and identical trials. If one uses these 
alternative definitions, then the pmf and mean formula must be adjusted accord- 
ingly. (The variance, however, will stay the same.) 

The authors of Matlab and R are among those who have adopted this alternative 
definition; as a result, we must be careful with our inputs to the relevant software 
functions. The pmf syntax for the distributions in this section are cataloged in 
Table 2.5; cdfs may be invoked by changing pdf to cdf in Matlab or the initial 
letter d to p in R. Notice the input argument x — r for the negative binomial 
functions: both software packages request the number of failures, rather than the 
number of trials. 
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Table 2.5 Matlab and R code for hypergeometric and negative binomial calculations 


Hypergeometric Negative Binomial 
Function: pmf pmf 
Notation: h(x; n, M, N) nb(x; r, p) 
Matlab: pdf (’hyge’ ,x,N,M,n) jorolie ((“ialloilial” 36 = 17,1? ,/9)) 
R: dhyper (x,M,N —M,n) dnbinom(x —1r,r,p) 


For example, suppose X has a hypergeometric distribution with n = 
10, M = 5, N = 25 as in Example 2.40. Using Matlab, we may calculate 
PX = 2) = pdf('’hyge’,2,25,5,10) and PX < 2) = 
cdf (’hyge’,2,25,5,10). The corresponding R_ function calls are 
dhyper(2,5,20,10) and phyper (2,5,20,10), respectively. If X is the 
negative binomial variable of Example 2.42 with parameters r = 4 and p = .2, then 
the chance of requiring 15 trials to achieve 4 successes (i.e., 11 total failures) can be 
found in Matlab with pdf (’nbin’,11,4,.2) and in R using the command 
dnbinom(11,4,.2). 


2.6.4 Exercises: Section 2.6 (90-106) 


90. An electronics store has received a shipment of 20 table radios that have 
connections for an iPod or iPhone. Twelve of these have two slots (so they can 
accommodate both devices), and the other eight have a single slot. Suppose 
that six of the 20 radios are randomly selected to be stored under a shelf where 
radios are displayed, and the remaining ones are placed in a storeroom. Let 
X = the number among the radios stored under the display shelf that have two 
slots. 

(a) What kind of a distribution does X have (name and values of all 
parameters)? 

(b) Compute P(X = 2), P(X < 2), and P(X > 2). 

(c) Calculate the mean value and standard deviation of X. 

91. Each of 12 refrigerators has been returned to a distributor because of an 
audible, high-pitched, oscillating noise when the refrigerator is running. 
Suppose that 7 of these refrigerators have a defective compressor and the 
other 5 have less serious problems. If the refrigerators are examined in random 
order, let X be the number among the first 6 examined that have a defective 
compressor. Compute the following: 

(a) P(X = 5) 

(b) P(X < 4) 

(c) The probability that X exceeds its mean value by more than 1 standard 
deviation. 

(d) Consider a large shipment of 400 refrigerators, of which 40 have defective 
compressors. If X is the number among 15 randomly selected refrigerators 
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that have defective compressors, describe a less tedious way to calculate 
(at least approximately) P(X < 5) than to use the hypergeometric pmf. 

An instructor who taught two sections of statistics last term, the first with 

20 students and the second with 30, decided to assign a term project. After all 

projects had been turned in, the instructor randomly ordered them before 

grading. Consider the first 15 graded projects. 

(a) What is the probability that exactly 10 of these are from the second 
section? 

(b) What is the probability that at least 10 of these are from the second 
section? 

(c) What is the probability that at least 10 of these are from the same section? 

(d) What are the mean and standard deviation of the number among these 
15 that are from the second section? 

(e) What are the mean and standard deviation of the number of projects not 
among these first 15 that are from the second section? 

A geologist has collected 10 specimens of basaltic rock and 10 specimens of 

granite. The geologist instructs a laboratory assistant to randomly select 15 of 

the specimens for analysis. 

(a) What is the pmf of the number of granite specimens selected for analysis? 

(b) What is the probability that all specimens of one of the two types of rock 
are selected for analysis? 

(c) What is the probability that the number of granite specimens selected for 
analysis is within | standard deviation of its mean value? 

A personnel director interviewing 11 senior engineers for four job openings 

has scheduled six interviews for the first day and five for the second day of 

interviewing. Assume the candidates are interviewed in random order. 

(a) What is the probability that x of the top four candidates are interviewed on 
the first day? 

(b) How many of the top four candidates can be expected to be interviewed on 
the first day? 

Twenty pairs of individuals playing in a bridge tournament have been seeded 

1,..., 20. In the first part of the tournament, the 20 are randomly divided into 

10 east-west pairs and 10 north-south pairs. 

(a) What is the probability that x of the top 10 pairs end up playing east—west? 

(b) What is the probability that all of the top five pairs end up playing the 
same direction? 

(c) If there are 2 pairs, what is the pmf of X = the number among the top 
n pairs who end up playing east-west? What are E(X) and Var(X)? 

A second-stage smog alert has been called in an area of Los Angeles County 

in which there are 50 industrial firms. An inspector will visit 10 randomly 

selected firms to check for violations of regulations. 

(a) If 15 of the firms are actually violating at least one regulation, what is the 
pmf of the number of firms visited by the inspector that are in violation of 
at least one regulation? 

(b) If there are 500 firms in the area, of which 150 are in violation, approxi- 
mate the pmf of part (a) by a simpler pmf. 
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(c) For X = the number among the 10 visited that are in violation, compute 
E(X) and Var(X) both for the exact pmf and the approximating pmf in 
part (b). 

A shipment of 20 integrated circuits (ICs) arrives at an electronics 

manufacturing site. The site manager will randomly select 4 ICs and test 

them to see whether they are faulty. Unknown to the site manager, 5 of these 

20 ICs are faulty. 

(a) Suppose the shipment will be accepted if and only if none of the inspected 
ICs is faulty. What is the probability this shipment of 20 ICs will be 
accepted? 

(b) Now suppose the shipment will be accepted if and only if at most one of 
the inspected ICs is faulty. What is the probability this shipment of 20 ICs 
will be accepted? 

(c) How do your answers to (a) and (b) change if the number of faculty ICs in 
the shipment is 3 instead of 5? Recalculate (a) and (b) to verify your claim. 

Suppose that 20% of all individuals have an adverse reaction to a particular 

drug. A medical researcher will administer the drug to one individual after 

another until the first adverse reaction occurs. Define an appropriate random 
variable and use its distribution to answer the following questions. 

(a) What is the probability that when the experiment terminates, four 
individuals have not had adverse reactions? 

(b) What is the probability that the drug is administered to exactly five 
individuals? 

(c) What is the probability that at most four individuals do not have an adverse 
reaction? 

(d) How many individuals would you expect to not have an adverse reaction, 
and how many individuals would you expect to be given the drug? 

(e) What is the probability that the number of individuals given the drug is 
within one standard deviation of what you expect? 

Suppose that p = P(female birth) = .5. A couple wishes to have exactly two 

female children in their family. They will have children until this condition is 

fulfilled. 

(a) What is the probability that the family has x male children? 

(b) What is the probability that the family has four children? 

(c) What is the probability that the family has at most four children? 

(d) How many children would you expect this family to have? How many 
male children would you expect this family to have? 

A family decides to have children until it has three children of the same 

gender. Assuming P(B) = P(G) = .5, what is the pmf of X = the number of 

children in the family? 

Three brothers and their wives decide to have children until each family has 

two female children. Let X = the total number of male children born to the 

brothers. What is E(X), and how does it compare to the expected number of 
male children born to each brother? 
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According to the article “Characterizing the Severity and Risk of Drought in 

the Poudre River, Colorado” (J. of Water Res. Planning and Mgmnt., 2005: 

383-393), the drought length Y is the number of consecutive time intervals in 

which the water supply remains below a critical value yo (a deficit), preceded 

and followed by periods in which the supply exceeds this value (a surplus). 

The cited paper proposes a geometric distribution with p = .409 for this 

random variable. 

(a) What is the probability that a drought lasts exactly 3 intervals? At least 
3 intervals? 

(b) What is the probability that the length of a drought exceeds its mean value 
by at least one standard deviation? 

Individual A has a red die and B has a green die (both fair). If they each roll 

until they obtain five “doubles” (HIG, .. ., HE), what is the pmf of X = the total 

number of times a die is rolled? What are E(X) and SD(X)? 

A carnival game consists of spinning a wheel with 10 slots, nine red and one 

blue. If you land on the blue slot, you win a prize. Suppose your significant 

other really wants that prize, so you will play until you win. 

(a) What is the probability you’ll win on the first spin? 

(b) What is the probability you’ll require exactly 5 spins? At least 5 spins? At 
most five spins? 

(c) What is the expected number of spins required for you to win the prize, 
and what is the corresponding standard deviation? 

A kinesiology professor, requiring volunteers for her study, approaches 

students one by one at a campus hub. She will continue until she acquires 

40 volunteers. Suppose that 25% of students are willing to volunteer for the 

study, that the professor’s selections are random, and that the student popula- 

tion is large enough that individual “trials” (asking a student to participate) 

may be treated as independent. 

(a) What is the expected number of students the kinesiology professor will 
need to ask in order to get 40 volunteers? What is the standard deviation? 

(b) Determine the probability that the number of students the kinesiology 
professor will need to ask is within one standard deviation of the mean. 

Refer back to the communication system of Example 2.44. Suppose a voice 

packet can be transmitted a maximum of 10 times, i.e., if the 10th attempt 

fails, no 11th attempt is made to retransmit the voice packet. Let X = the 

number of times a message is transmitted. Assuming each transmission 

succeeds with probability p, determine the pmf of X. Then obtain an expres- 

sion for the expected number of times a packet is transmitted. 


Moments and Moment Generating Functions 


The expected values of integer powers of X and X — yw are often referred to as 
moments, terminology borrowed from physics. In this section, we’ll discuss the 
general topic of moments and develop a shortcut for computing them. 
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DEFINITION 

The kth moment of a random variable X is EO, while the kth 
moment about the mean (or kth central moment) of X is E[(X — mab 
where pv = E(X). 


For example, « = E(X) is the “first moment” of X and corresponds to the center 
of mass of the distribution of X. Similarly, Var(X) = E[(X — nw)’ is the second 
moment of X about the mean, which is known in physics as the moment of inertia. 


Example 2.45 A popular brand of dog food is sold in 5, 10, 15, and 20 lb bags. Let 
X be the weight of the next bag purchased, and suppose the pmf of X is 


x | 5 10 15 20 
px) | 2 A 4 


The first moment of X is its mean: 
w= E(X) = S© xp(x) = 5(.1) + 10(.2) + 15(.3) + 20(.4) = 151bs 
xED 
The second moment about the mean is the variance: 
o = E[(X—9)"] = 0 = 9)P) 
xED 
(5 = 15)7() + (10 = 15) (2) + (15 — 15)" 3) + (0 = 15) (4) = 25; 
for a standard deviation of 5 lb. The third central moment of X is 


E[(X—p)"] =) &—4)’p@) 


= : — 15)°(.1) + (10 — 15)3(.2) + (15 — 15)3(.3) 
+ (20 — 15)?(.4) 
= -75 


We’ll discuss an interpretation of this last number shortly. a 


It is not difficult to verify that the third moment about the mean is 0 if the pmf of 
X is symmetric. We would like to use E[(X — j1)*] as a measure of lack of symmetry, 
but it depends on the scale of measurement. If we switch the unit of weight in 
Example 2.45 from pounds to ounces or kilograms, the value of the third moment 
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about the mean (as well as the values of all the other moments) will change. But we 
can achieve scale independence by dividing the third moment about the mean by o: 


fe) _ fe) 220) 


o o 


Expression (2.20) is our measure of departure from symmetry, called the skew- 
ness coefficient. The skewness coefficient for a symmetric distribution is 0 because 
its third moment about the mean is 0. However, in the foregoing example the 
skewness coefficient is E[(X — py Vo* = —75/5° = —0.6. When the skewness 
coefficient is negative, as it is here, we say that the distribution is negatively skewed 
or that it is skewed to the left. Generally speaking, it means that the distribution 
stretches farther to the left of the mean than to the right. 

If the skewness were positive, then we would say that the distribution is 
positively skewed or that it is skewed to the right. For example, reverse the order 
of the probabilities in the p(x) table above, so the probabilities of the values 5, 10, 
15, 20 are now .4, .3, .2, and .1, (customers now favor much smaller bags of dog 
food). Exercise 119 shows that this changes the sign but not the magnitude of the 
skewness coefficient, so it becomes +0.6 and the distribution is skewed right. Both 
distributions are illustrated in Fig. 2.9. 


a b 
P(x) P(x) 
0.4 0.4 
0.3 0.3 
0.2 0.2 
0.1 ] 0.1 I 
. 10 15 ian : 5 10 15 2 


Fig. 2.9 Departures from symmetry: (a) skewness coefficient < 0 (skewed left); (b) skewness 
coefficient > 0 (skewed right) 


2.7.1 The Moment Generating Function 


Calculation of the mean, variance, skewness coefficient, etc. for a particular 
discrete rv requires extensive, sometimes tedious, summation. Mathematicians 
have developed a tool, the moment generating function, that will allow us to 
determine the moments of a distribution with less effort. Moreover, this function 
will allow us to derive properties of several of our major probability distributions 
here and in subsequent sections of the book. 
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DEFINITION 
The moment generating function (mgf) of a discrete random variable X is 
defined to be 


MeO =E(e*) = as e™p(x) 


77D) 


where D is the set of possible X values. The moment generating function 
exists iff Mx(t) is defined for an interval that includes zero as well as positive 
and negative values of f. 


For any random variable X, the mgf evaluated at t = 0 is 


Mx(0) = E(e™) = So e™p(x) = © Ip(x) = 1 


x&D x&D 


That is, My(0) is the sum of all the probabilities, so it must always be |. However, 
in order for the mgf to be useful in generating moments, it will need to be defined 
for an interval of values of ¢ including 0 in its interior. The moment generating 
function fails to exist in cases when moments themselves fail to exist (see Example 
2.49 below). 


Example 2.46 The simplest example of an mgf is for a Bernoulli distribution, 
where only the X values 0 and 1 receive positive probability. Let X be a Bernoulli 
random variable with p(0) = 1/3 and p(1) = 2/3. Then 


Mx(t) = E(e™) = S© e%p(x) = ef. (1/3) +e"! -(2/3) = (1/3) +(2/3)e 


x&D 


A Bernoulli random variable will always have an mef of the form p(0) + p(1)e’, a 
well-defined function for all values of ¢. = 


A key property of the mgf is its “uniqueness,” the fact that it completely 
characterizes the underlying distribution. 


MGF UNIQUENESS THEOREM 

If the mgf exists and is the same for two distributions, then the two 
distributions are identical. That is, the moment generating function uniquely 
specifies the probability distribution; there is a one-to-one correspondence 
between distributions and mgfs. 
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The proof of this theorem, originally due to Laplace, requires some sophisticated 
mathematics and is beyond the scope of this textbook. 


Example 2.47 Let X, the number of claims submitted on a renter’s insurance 
policy on a given year, have mgf My(t) = .7 + .2e’ + .le”’. It follows that 
X must have the pmf p(0) = .7, p(1) = .2, and p(2) = .1—because if we use this 
pmf to obtain the mgf, we get M,(t), and the distribution is uniquely determined by 
its megf. a 


Example 2.48 Consider testing individuals’ blood samples one by one in order to 
find someone whose blood type is Rh+. Suppose X, the number of tested samples, 
has a geometric distribution with p = .85: 


p(x) = .85(.15)*""for x = 1,2,3,.... 


Determining the moment generating function here requires using the formula for 
the sum of a geometric series: 1 +7 + Pees 1/1 — r) for Irl < 1. The moment 
generating function is 


Mx(t) = E(e*) = S© e*p(x) = De® .85(.15)" 1 = 852 eV (151 
xD x=1 


Es 2 85¢e! 
= 85eS (150)! = 85e"[1 + 15e! + (.15e")? + +] = 


x=1 


The condition on r requires |.15e'l < 1. Dividing by .15 and taking logs, this 
gives t < —In(.15) & 1.90, i.e., this function is defined in the interval (—oo, 1.90). 
The result is an interval of values that includes 0 in its interior, so the mgf exists. As 
a check, My(0) = .85/(1 — .15) = 1, as required. | 


Example 2.49 Reconsider Example 2.20, where p(x) = ke’, x=1,2,3,.... Recall 
that E(X) does not exist for this distribution, portending a problem for the existence 
of the mgf: 


Mx(t) = E(e*) = yes 


With the help of tests for convergence such as the ratio test, we find that the 
series converges if and only if e’ < 1, which means that t < 0, i.e., the mgf is only 
defined on the interval (—oo, 0]. Because zero is on the boundary of this interval, 
not the interior of the interval (the interval must include both positive and negative 
values), the mgf of this distribution does not exist. In any case, it could not be useful 
for finding moments, because X does not have even a first moment (mean). | 
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2.7.2. Obtaining Moments from the MGF 


We now turn to the computation of moments from the mgf. For any positive integer 
r, let M(t) denote the th derivative of My(t). By computing this and then setting 
t = 0, we get the 7th moment about 0. 


THEOREM 
If the mgf of X exists, then E(X’) is finite for all positive integers 7, and 


E(X") = MY(0) (2.21) 


Proof The proof of the existence of all moments is beyond the scope of this book. 
We will show that Eq. (2.21) is true for r = 1 and r = 2. A proof by mathematical 
induction can be used for general r. Differentiate: 


< Mx( res e“p(x > © p(x) = So xe“p(x) 


x©&D 


where we have interchanged the order of summation and differentiation. (This is 
justified inside the interval of convergence, which includes 0 in its interior.) Next 
set t = 0 to obtain the first moment: 


Differentiating a second time gives 


a d XI XI 
ae Mx( ) = 2 xe" p(x = dog el) = Sep) 


x€D x&D x&D 


Set ¢ = 0 to get the second moment: 


My(0) = MY? (0) = S~ p(x) = E(X’) 


x&D a 


For the pmfs in Examples 2.45 and 2.46, this may seem like needless work— 
after all, for simple distributions with just a few values, we can quickly determine 
the mean, variance, etc. The real utility of the mgf arises for more complicated 
distributions. 
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Example 2.50 (Example 2.48 continued) Recall that p = .85 is the probability of a 
person having Rh+ blood and we keep checking people until we find one with this 
blood type. If X is the number of people we need to check, then p(x) = .85(.15)"~', 
x= 1,2,3,..., and the megf is 


85e' 


Ma) ) a ase 


Differentiating with the help of the quotient rule, 


: 85e! 
Pa (1 — .15e? 


Setting t = 0 then gives p= E(X) Mx(0) 1/.85 = 1.176. This corresponds to 
the formula 1/p for a geometric distribution. 
To get the second moment, differentiate again: 


» .  85e!(1 + .15e') 
Mx() = (1 = 1he} 


MW 1.1 
Setting t = 0, E(X*) = My(0) - 1.15/.857. Now use the variance 


shortcut formula: 


1.15 A> 45 
Var(X) = o* = E(X? _ = = .2076 
RS era es (=) 85" 


This matches the variance formula (1 — p)/p given without proof toward the end 
of Sect. 2.6. a 


As mentioned in Sect. 2.3, it is common to transform a rv X using a linear 
function Y = aX + b. What happens to the mgf when we do this? 


PROPOSITION 
Let X have the mgf M,(f) and let Y = aX + b. Then My(t) = e "My(at). 


Example 2.51 Let X be a Bernoulli random variable with p(O) = 20/38 and p(1) = 
18/38. Think of X as the number of wins, 0 or 1, in a single play of roulette. If you 
play roulette at an American casino and bet on red, then your chances of winning 
are 18/38 because 18 of the 38 possible outcomes are red. From Example 2.46, 
Mx(t) = 20/38 + e’(18/38). Suppose you bet $5 on red, and let Y be your winnings. If 
X = 0 then Y = —5, and if X = | then Y = 5. The linear equation Y = 10X — 5 
gives the appropriate relationship. 
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This equation is of the form Y = aX + b with a = 10 and b = —5, so by the 
foregoing proposition 


This implies that the pmf of Y is p(—5) = 20/38 and p(5) = 18/38; moreover, we 
can compute the mean (and other moments) of Y directly from this megf. a 


2.7.3. MGFs of Common Distributions 


Several of the distributions presented in this chapter (binomial, Poisson, negative 
binomial) have fairly simple expressions for their moment generating functions. 
These mgfs, in turn, allow us to determine the means and variances of the 
distributions without some rather unpleasant summation. (Additionally, we will 
use these mgfs to prove some more advanced distributional properties in Chap. 4.) 

To start, determining the moment generating function of a binomial rv requires 


use of the binomial theorem: (a + 5)" = Ia ; Gua Then 


Mx(t) 


E(e%) = So eb(x;2,p) = ye(T)ru _ py 


x&D 


n 


x=0 


The mean and variance can be obtained by differentiating My(?): 


M,(t)=n(pe'+1—p)""'pe > p=My (0) = np; 
My(t) = n(n — 1) (pe! +1 — p)" *pe'pe! + n(pe" +1 —p)""'pe' => 


in accord with the proposition in Sect. 2.4. 
Derivation of the Poisson mgf utilizes the series expansion ) 29 uw/x! =e": 


= eH eke — erle-1) 
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Successive differentiation then gives the mean and variance identified in 
Sect. 2.5 (see Exercise 127). 

Finally, derivation of the negative binomial mgf is based on Newton’s generali- 
zation of the binomial theorem. The result (see Exercise 124) is 


m= (ae) 


The geometric mgf is just the special case r = 1 (cf. Example 2.48 above). There 
is unfortunately no simple expression for the mgf of a hypergeometric rv. 


2.7.4 Exercises: Section 2.7 (107-128) 


107. For the entry-level employees of a certain fast food chain, the pmf of 
X = highest grade level completed is specified by p(9) = .01, p(10) = .05, 
p(i1) = .16, and p(12) = .78. 

(a) Determine the moment generating function of this distribution. 
(b) Use (a) to find E(X) and SD(X). 

108. For a new car the number of defects X has the distribution given by the 

accompanying table. Find M(t) and use it to find E(X) and SD(X). 


x 0 1 2 3 4 > 6 
p(x) 04 20 34 20 15 04 .03 


109. In flipping a fair coin let X be the number of tosses to get the first head. Then 
p(x) = .5* for x = 1, 2, 3, .... Find Mx(A) and use it to get E(X) and SD(X). 

110. If you toss a fair die with outcome X, p(x) = 1/6 for x = 1, 2, 3, 4,5, 6. Find 
Mx(t). 

111. Find the skewness coefficients of the distributions in the previous four 
exercises. Do these agree with the “shape” of each distribution? 

112. Given My(f) = .2 + .3e' + .5e*, find p(x), E(X), Var(X). 

113. If Mx(t) = 1/1 — P), find E(X) and Var(X). 

114. Show that g(t) = te’ cannot be a moment generating function. 

115. Using a calculation similar to the one in Example 2.48 show that, if X has a 
geometric distribution with parameter p, then its megf is 


a 
1—(1—p)e! 


Assuming that Y has mgf My(t) = .75e/(1 — .25e'), determine the proba- 
bility mass function p(y) with the help of the uniqueness property. 
116. (a) Prove the result in the second proposition: M,x+,(4) = eM. ‘y(at). 
(b) Let Y= aX + b. Use (a) to establish the relationships between the means 
and variances of X and Y. 


Mx(t) = 
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118. 


119. 


120. 


121. 


122. 


123. 


124. 


125. 


Let Mx(t) = e +2" and let Y = (X — 5)/2. Find My(f) and use it to find E(Y) 
and Var(Y). 

Let X have the moment generating function of Example 2.48 and let Y= X — 1. 
Recall that X is the number of people who need to be checked to get someone 
who is Rh+, so Y is the number of people checked before the first Rh+ person is 
found. Find My(?). 

Let X be the number of points earned by a randomly selected student on a 
10 point quiz, with possible values 0, 1, 2, ..., 10 and pmf p(x), and suppose 
the distribution has a skewness coefficient of c. Now consider reversing the 
probabilities in the distribution, so that p(0) is interchanged with p(10), p(1) is 
interchanged with p(9), and so on. Show that the skewness coefficient of the 
resulting distribution is —c. [Hint: Let Y = 10 — X and show that Y has the 
reversed distribution. Use this fact to determine py and then the value of 
skewness coefficient for the Y distribution. ] 

Let My(t) be the moment generating function of a rv X, and define a new 
function by 


Ly(t) = In[Mx(0)] 


Show that (a) Ly(0) = 0, (b) Ly(0) =p, and (c) Ly (0) = 0°. 

Refer back to Exercise 120. If My(t) = e'+2" then find E(X) and Var(X) by 
differentiating 

(a) Mx(t) 

(b) Lx(t) 

Refer back to Exercise 120. If My(t) = e*(—) then find E(X) and Var(X) by 
differentiating 

(a) Mx(t) 

(b) Lx(t) 

Obtain the moment generating function of the number of failures, n — X, ina 
binomial experiment, and use it to determine the expected number of failures 
and the variance of the number of failures. Are the expected value and 
variance intuitively consistent with the expressions for E(X) and Var(X)? 
Explain. 

Newton’s generalization of the binomial theorem can be used to show that, for 
any positive integer r, 


w= (TERT) 


k=0 


Use this to derive the negative binomial mgf presented in this section. Then 
obtain the mean and variance of a binomial rv using this mgf. 

If X is a negative binomial rv, then Y = X — r is the number of failures 
preceding the 7th success. Obtain the mgf of Y and then its mean value and 
variance. 
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126. Refer back to Exercise 120. Obtain the negative binomial mean and variance 
from Ly(t) = In[My(t)]. 

127. (a) Use derivatives of My(t) to obtain the mean and variance for the Poisson 

distribution. 
(b) Obtain the Poisson mean and variance from L(t) = In[M,(t)]. In terms of 
effort, how does this method compare with the one in part (a)? 

128. Show that the binomial moment generating function converges to the Poisson 
moment generating function if we let n — oo and p — 0 in such a way that np 
approaches a value yw > 0. [Hint: Use the calculus theorem that was used in 
showing that the binomial pmf converges to the Poisson pmf.] There is, in 
fact, a theorem saying that convergence of the mgf implies convergence of the 
probability distribution. In particular, convergence of the binomial megf to the 
Poisson mgf implies b(x; n, p) > p(x; /). 


2.8 Simulation of Discrete Random Variables 


Probability calculations for complex systems often depend on the behavior of 
various random variables. When such calculations are difficult or impossible, 
simulation is the fallback strategy. In this section, we give a general method for 
simulating an arbitrary discrete random variable and consider implementations in 
existing software for simulating common discrete distributions. 


Example 2.52 Refer back to the distribution of Example 2.11 for the random 
variable X = the amount of memory (GB) in a purchased flash drive, and suppose 
we wish to simulate X. Recall from Sect. 1.6 that we begin with a “standard 
uniform” random number generator, i.e., a software function that generates evenly 
distributed numbers in the interval [0, 1). Our goal is to convert these decimals into 
the values of X with the probabilities specified by its pmf: 5% 1s, 10% 2s, 35% 4s, 
and so on. To that end, we partition the interval [0, 1) according to these 
percentages: [0, .05) has probability .05; [.05, .15) has probability .1, since the 
length of the interval is .1; [.15, .5) has probability .5 — .15 = .35; etc. Proceed as 
follows: given a value u from the RNG, 

— If0<u< .05, assign the value | to the variable x. 

— If .05 <u < .15, assign x = 2. 

— If .15 <u < .50, assign x = 4. 

— If 50 <u < .90, assign x = 8. 

— If 90 <u < 1, assign x = 16. 

Repeating this algorithm n times gives n simulated values of X. Programs in 
Matlab and R that implement this algorithm appear in Fig. 2.10; both return a 
vector, x, containing n = 10,000 simulated values of the specified distribution. 

Figure 2.11 shows a graph of the results of executing the code, in the form of a 
histogram: the height of each rectangle corresponds to the relative frequency of 
each x value in the simulation (i.e., the number of times that value occurred, divided 
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a b 
x=zeros (10000,1); x <- NULL 
for i=1:10000 for (i in 1:10000) { 
u=rand; u=runif (1) 
if u<.05 if (u<x.05) 
x(i)=1; x[i]<-1 
elseif u<.15 else if (u<.15) 
x(1)=2; x[i]<-2 
elseif u<.50 else if (u<.50) 
x(i)=4; x[i]<-4 
elseif u<.90 else if (u<.90) 
x (1) =8; x[i]<-8 
else else 
x(i)=16; x[i]<-16 
end } 
end 


Fig. 2.10 Simulation code: (a) Matlab; (b) R 


0.45 req 


0.35 


Probability 
=) 
i 
! 


0.15 


0.0 


T T 1 
0 5 10 15 20 


Fig. 2.11 Simulation and exact distribution for Example 2.52 


by 10,000). The exact pmf of X is superimposed for comparison; as expected, 
simulation results are similar, but not identical, to the theoretical distribution. 
Later in this section, we will present a faster, built-in way to simulate discrete 
distributions in Matlab and R. The method introduced here will, however, prove 
useful in adapting to the case of continuous random variables in Chap. 3. a 


In the preceding example, the selected subintervals of [0, 1) were not our only 
choices—any five intervals with lengths .05, .10, .35, .40, and .10 would produce 
the desired result. However, those particular five subintervals have one desirable 
feature: the “cut points” for the intervals (i.e., 0, .05, .15, .50, .90, and 1) are 
precisely the possible heights of the graph of the cdf, F(x). This permits a geometric 
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Fig. 2.12 The inverse cdf method for Example 2.52 


interpretation of the algorithm, which can be seen in Fig. 2.12. The value uv provided 
by the RNG corresponds to a position on the vertical axis between 0 and 1; we then 
“invert” the cdf by matching this u-value back to one of the gaps in the graph of 
F(x), denoted by dotted lines in Fig. 2.12. If the gap occurs at horizontal position x, 
then x is our simulated value of the rv X for that run of the simulation. This is often 
referred to as the inverse cdf method for simulating discrete random variables. The 
general method is spelled out in the accompanying box. 


Inverse cdf Method for Simulating Discrete Random Variables 

Let X be a discrete random variable taking on values x; < x. < ... with 
corresponding probabilities p,, po, .... Define Fy = 0; Fy = F(X) = py; 
Fy = F(x2) = p; + p2; and, in general, Fy = FQ,) =pi +--+ +pe=Feit+De 
To simulate a value of X, proceed as follows: 

1. Use an RNG to produce a value, u, from [0, 1). 

2. If Fy_; <u < Fy, then assign x = xx. 


Example 2.53 (Example 2.52 continued): Suppose the prices for the flash drives, 
in increasing order of memory size, are $10, $15, $20, $25, and $30. If the store 
sells 80 flash drives in a week, what’s the probability they will make a gross profit of 
at least $1800? 
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Let Y = the amount spent on a flash drive, which has the following pmntf: 


y |10 15 20 25 30 
p(y) |.05 10 35 40 10 


The gross profit for 80 purchases is the sum of 80 values from this distribution. 
Let A = {gross profit > $1800}. We can use simulation to estimate P(A), as 
follows: 

0. Set a counter for the number of times A occurs to zero. 

Repeat n times: 

1. Simulate 80 values y,, .. ., ygo from the above pmf (using for example an inverse 

cdf program similar to those displayed in Fig. 2.10). 

2. Compute the week’s gross profit, g = y) +--+ + Ygo. 
3. If g > 1800, add 1 to the count of occurrences for A. 

Once the n runs are complete, then P (A) = (count of the occurrences of A)/n. 

Figure 2.13 shows the resulting values of g for n = 10,000 simulations in R. In 
effect, our program is simulating a random variable G = Y, +... + Ygq whose pmf is 
not known (in light of all the possible G values, it would not be worthwhile to 
attempt to determine its pmf analytically). The highlighted bars in Fig. 2.13 corre- 
spond to g values of at least $1800; in our simulation, such values occurred 1940 
times. Thus, P(A) = 1940/10,000 = .194, with an estimated standard error of 


/.194(1 — .194)/10, 000 = .004. 
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Fig. 2.13 Simulated distribution of weekly gross profit for Example 2.53 | 
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2.8.1 Simulations Implemented in R and Matlab 


Earlier in this section, we presented the inverse cdf method as a general way to 
simulate discrete distributions applicable in any software. In fact, one can simulate 
generic discrete rvs in both Matlab and R by clever use of the built-in 
randsample and sample functions, respectively. We saw these functions in 
the context of probability simulation in Chap. |. Both are designed to generate a 
random sample from any selected set of values (even including text values, if 
desired); the “clever” part is that both can accommodate a set of weights. The 
following short example illustrates their use. 

To simulate, say, 35 values from the pmf in Example 2.52, one can use the 
following code in Matlab: 


randsample([1,2,4,8,16],35,true,[.05,.10,.35,.40, .10]) 


The function takes four arguments: the list of x-values, the desired number of 
simulated values (the “sample size”), whether to sample with replacement (here, 
true), and the list of probabilities in the same order as the x-values. The 
corresponding call in R is 


sample(c(1,2,4,8,16),35,TRUE,c(.05,.10,.35,.40,.10)) 


Thanks to the ubiquity of the binomial, Poisson, and other distributions in 
probability modeling, many software packages have built-in tools for simulating 
values from these distributions. Table 2.6 summarizes the relevant functions in 
Matlab and R; the input argument sampsize refers to the desired number of 
simulated values of the distribution. 

A word of warning (really, a reminder) about the way software treats the negative 
binomial distribution: both Matlab and R define a negative binomial rv as the number 
of failures preceding the rth success, which differs from our definition. Assuming you 
want to simulate the number of trials required to achieve r successes, execute the 
code in the last line of Table 2.6 and then add r to each value. 


Table 2.6 Functions to simulate major discrete distributions in Matlab and R 


Distribution Matlab code R code 

Binomial random(’bin’ ,n,p, rbinom (sampsize,n,p) 
[sampsize,1]) 

Poisson random(’pois’ ,p, [sampsize,1]) rpois (sampsize , 1) 

Hypergeometric random(‘hyge’,N,M,n, rhyper (sampsize,M, 
[sampsize,1]) N—M,n) 

Negative random(’/nbin’,r,p, rnbinom (sampsize,r,p) 


binomial [sampsize,1]) 
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Example 2.54 The number of customers shipping express mail packages at a 
certain store during any particular hour of the day is a Poisson rv with mean 5. 
Each such customer has 1, 2, 3, or 4 packages with probabilities .4, .3, .2, and .1, 
respectively. Let’s carry out a simulation to estimate the probability that at most 
10 packages are shipped during any particular hour. 
Define an event A = {at most 10 packages shipped in an hour}. Our simulation to 
estimate P(A) proceeds as follows. 
0. Set a counter for the number of times A occurs to zero. 
Repeat n times: 
1. Simulate the number of customers in an hour, C, which is Poisson with p = 5. 
2. For each of the C customers, simulate the number of packages shipped according 
to the pmf above. 
3. If the total number of packages shipped is at most 10, add | to the counter for A. 
Matlab and R code to implement this simulation appear in Fig. 2.14. 


a b 
A=0; A <- 0 
for i=1:10000 for (i in 1:10000) { 
c=random('pois',5,1); c<-rpois (1,5) 
packages = randsample([1,2,3,4],c, packages <- sample(c(1,2,3,4),c, 
true, [.4423) 22). 1))4 TRUE: C2624 pe 3p 22 pe) 
if sum(packages) <=10 if (sum(packages) <=10) { 
A=A+1; A<-A+1 
end } 
end } 


Fig. 2.14 Simulation code for Example 2.54: (a) Matlab; (b) R 


In Matlab, 10,000 simulations resulted in 10 or fewer packages 5752 times, for 
an estimated probability of P(A) = .5752, with an estimated standard error of 
/ 5752(1 — .5752)/10000 = .0049. =| 


2.8.2. Simulation Mean, Standard Deviation, and Precision 


In Sect. 1.6 and in the preceding examples, we used simulation to estimate the 
probability of an event. But consider the “gross profit” variable G in Example 2.53: 
since we have 10,000 simulated values of this variable, we should be able to 
estimate its mean jg and its standard deviation og. More generally, suppose we 
have simulated n values x;, ..., x, of a random variable X. Then the following 
quantities based on our observed values serve as suitable estimates. 
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DEFINITION 
For a set of numerical values xj, ..., x,, the sample mean, denoted by x, is 


n 
= SA) aR 8 88 SP BG il 
y= =) Xj 
n LS 
i—1 


The sample standard deviation of these numerical values, denoted by s, is 


If x1, ..., X, represent simulated values of a random variable X, then we 
may estimate the expected value and standard deviation of X by fi = X and 
6 = 5S, respectively. 


The justification for the use of the divisor n — 1 in s will be discussed in Chap. 5. 

In Sect. 1.6, we introduced the standard error of an estimated probability, which 
quantifies the precision of a simulation result P (A) as an estimate of a “true” 
probability P(A). By analogy, it is possible to quantify the amount by which a 
sample mean, x, will generally differ from the corresponding expected value yw. For 
n simulated values of a random variable, with sample standard deviation s, the 
(estimated) standard error of the mean is 


S 
Van 
Expression (2.22) will be derived in Chap. 4. As with an estimated probability, 
the formula indicates that the precision of X increases (i.e., its standard error 
decreases) as n increases, but not very quickly. To increase the precision of X as 
an estimate of yw by a factor of 10 (one decimal place) requires increasing the 


number of simulation runs, n, by a factor of 100. Unfortunately, there is no general 
formula for the standard error of s as an estimate of o. 


(2.22) 


Example 2.55 (Ex. 2.53 continued) The 10,000 simulated values of the random 
variable G, which we denote by g1, ..., 210000, are displayed in the histogram in 
Fig. 2.13. From these simulated values, we can estimate both the expected value 
and standard deviation of G: 
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1 10,000 
g, = 1759.62 
10, 000 » 
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We estimate that the average weekly gross profit from flash drive sales is 
$1759.62, with a standard deviation of $43.50. Neither of these computations 
was performed by hand, of course: if the n simulated values of a variable are 
stored in a vector x, then mean (x) and sd(x) in R will provide the sample 
mean and standard deviation, respectively. In Matlab, the calls are mean (x) and 
std(x). 

Applying Eq. (2.22), the (estimated) standard error of g is 
s//n = 43.50/1/10,000 = 0.435. If 10,000 runs are used to simulate G, it’s 
estimated that the resulting sample mean will differ from E(G) by roughly 0.435. 
(In contrast, the sample standard deviation, s, estimates that gross profit for a single 
week—.e., a single observation g—typically differs from E(G) by about $43.50.) 

In Chap. 4, we will see how the expected value and variance of random variables 
like G, that are sums of a fixed number of other rvs, can be obtained analytically. @ 


Example 2.56 The “help desk” at a university’s computer center receives both 
hardware and software queries. Let X and Y be the number of hardware and software 
queries, respectively, in a given day. Each can be modeled by a Poisson distribution 
with mean 20. Because computer center employees need to be allocated efficiently, 
of interest is the difference between the sizes of the two queues: D = |X — YI. Let’s 
use simulation to estimate (1) the probability the queue sizes differ by more than 5; 
(2) the expected difference; (3) the standard deviation of the difference. 

Figure 2.15 shows Matlab and R code to simulate this process. In both 
languages, the code exploits the built-in Poisson simulator, as well as the fact that 
10,000 simulated values may be called simultaneously. 


a b 
X=random('pois',20,[10000,1]); X<-rpois (10000, 20) 
Y=random('pois',20,[10000,1]); Y<-rpois (10000,20) 
D=abs (X-Y); D<-abs (X-Y) 
sum ((D>5) ) sum((D>5) ) 
mean (D) mean (D) 
std(D) sd(D) 


Fig. 2.15 Simulation code for Example 2.56: (a) Matlab; (b) R 


The line sum((D>5) ) performs two operations: first, (D>5) determines if 
each simulated d value exceeds 5, returning a logical vector of bits; second, sum ( ) 
tallies the “success” bits (1s or TRUEs) and gives a count of the number of times the 
event {D > 5} occurred in the 10,000 simulations. The results from one run in 
Matlab were 
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; 3843 = 
P(D > 5) = Th a9p = 3843, fp = d= 5.0380 6p = s = 3.8436 


A histogram of the simulated values of D appears in Fig. 2.16. 
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Fig. 2.16 Simulation histogram of D in Example 2.56 a 


2.8.3 Exercises: Section 2.8 (129-141) 


129. 


130. 


131. 


132. 


Consider the pmf given in Exercise 30 for the random variable Y = the 
number of moving violations for which the a randomly selected insured 
individual was cited during the last 3 years. Write a program to simulate this 
random variable, then use your simulation to estimate E(Y) and SD(Y). How 
do these compare to the exact values of E(Y) and SD(Y)? 
Consider the pmf given in Exercise 32 for the random variable X = capacity 
of a purchased freezer. Write a program to simulate this random variable, 
then use your simulation to estimate E(X) and SD(X). How do these compare 
to the exact values of E(X) and SD(X)? 
Suppose person after person is tested for the presence of a certain character- 
istic. The probability that any individual tests positive is .75. Let X = the 
number of people who must be tested to obtain five consecutive positive test 
results. Use simulation to estimate P(X < 25). 
The matching problem. Suppose that N items labeled 1, 2, ..., N are shuffled 
so that they are in random order. Of interest is how many of these will be in 
their “correct” positions (e.g., item #5 situated at the 5th position in the 
sequence, etc.) after shuffling. 
(a) Write a program that simulates a permutation of the numbers | to N and 
then records the value of the variable X = number of items in the correct 
position. 
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133. 


134. 


135: 


136. 


137. 


(b) Set N = 5 in your program, and use at least 10,000 simulations to 
estimate E(X), the expected number of items in the correct position. 

(c) Set N = 52 in your program (as if you were shuffling a deck of cards), 
and use at least 10,000 simulations to estimate E(X). What do you 
discover? Is this surprising? 

Exercise 109 of Chap. | referred to a multiple-choice exam in which 10 of 

the questions have two options, 13 have three options, 13 have four options, 

and the other 4 have five options. Let X = the number of questions a student 
gets right, assuming s/he is completely guessing. 

(a) Write a program to simulate X, and use your program to estimate the 
mean and standard deviation of X. 

(b) Estimate the probability a student will score at least one standard devia- 
tion above the mean. 

Example 2.53 of this section considered the gross profit G resulting from 
selling flash drives to 80 customers per week. Of course, it isn’t realistic for the 
number of customers to remain fixed from week to week. So, instead, imagine 
the number of customers buying flash drives in a week follows a Poisson 
distribution with mean 80, and that the amount paid by each customer follows 
the distribution for Y provided in that example. Write a program to simulate 
the random variable G, and use your simulation to estimate 

(a) The probability that weekly gross sales are at least $1,800. 

(b) The mean of G. 

(c) The standard deviation of G. 

Exercise 21 (Sect. 2.2) investigated Benford’s law, a discrete distribution 

with pmf given by p(x) = logio((x + 1)/x) for x = 1, 2, ..., 9. Use the inverse 

cdf method to write a program that simulates the Benford’s law distribution. 

Then use your program to estimate the expected value and variance of this 

distribution. 

Recall that a geometric rv has pmf p(x) = p(1 — p)*' for x = 1,2, 3,.... In 

Example 2.12, it was shown that the cdf of this distribution is given by 

F(x) = 1 — (1 — p)* for positive integers x. 

(a) Write a program that implements the inverse cdf method to simulate a 
geometric distribution. Your program should have as inputs the numeri- 
cal value of p and the desired sample size. 

(b) Use your program to simulate 10,000 values from a geometric rv X with 
p =.85. From these values, estimate each of the following: P(X < 2), E(X), 
SD(X). How do these compare to the corresponding exact values? 

Tickets for a particular flight are $250 apiece. The plane seats 120 passengers, 

but the airline will knowingly overbook (i.e., sell more than 120 tickets), 

because not every paid passenger shows up. Let f denote the number of tickets 
the airline sells for this flight, and assume the number of passengers that 
actually show up for the flight, X, follows a Bin(?¢, .85) distribution. 

Let B = the number of paid passengers who show up at the airport but are 
denied a seat on the plane, so B = X — 120 if X > 120 and B = 0 otherwise. If 
the airline must compensate these passengers with $500 apiece, then the 
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138. 


139. 


2 Discrete Random Variables and Probability Distributions 


profit the airline makes on this flight is 250¢ — 500B. (Notice tf is fixed, but 

B is random.) 

(a) Write a program to simulate this scenario. Specifically, your program 
should take in ¢ as an input and return many values of the profit variable 
250t — S500B. 

(b) The airline wishes to determine the optimal value of f¢, i.e., the number of 
tickets to sell that will maximize their expected profit. Run your program 
for t= 140, 141, ..., 150, and record the average profit from many runs 
under each of these settings. What value of f appears to return the largest 
value? [Note: If a clear winner does not emerge, you might need to 
increase the number of runs for each ¢ value! ] 

Imagine the following simple game: flip a fair coin repeatedly, winning $1 for 

every head and losing $1 for every tail. Your net winnings will potentially 

oscillate between positive and negative numbers as play continues. How many 

times do you think net winnings will change signs in, say, 1000 coin flips? 

5000 flips? 

(a) Let X = the number of sign changes in 1000 coin flips. Write a program to 
simulate X, and use your program to estimate the probability of at least 
10 sign changes. 

(b) Use your program to estimate E(X) and SD(X). Does your estimate for 
E(X) match your intuition for the number of sign changes? 

(c) Repeat parts (a)-(b) with 5000 flips. 

Exercise 39 (Sect. 2.3) describes the game Plinko from The Price is Right. 

Each contestant drops between one and 5 chips down the Plinko board, 

depending on how well s/he prices several small items. Suppose the random 

variable C = number of chips earned by a contestant has the following 

distribution: 


c 1 2 3 4 5 
plc) |.03 15 35 34 13 


The winnings from each chip follow the distribution presented in Exercise 
39. Write a program to simulate Plinko; you will need to consider both the 
number of chips a contestant earns and how much money is won on each of 
those chips. Use your simulation estimate the answers to the following 
questions: 

(a) What is the probability a contestant wins more than $11,000? 

(b) What is a contestant’s expected winnings? 

(c) What is the corresponding standard deviation? 

(d) In fact, a player gets one Plinko chip for free and can earn the other four 
by guessing the prices of small items (waffle irons, alarm clocks, etc.). 
Assume the player has a 50-50 chance of getting each price correct, so we 
may write C= 1 +R, where R ~ Bin(4, .5). Use this revised model for C to 
estimate the answers to (a)-(c). 
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140. 


141. 


2.9 


Recall the Coupon Collector’s Problem described in the book’s Introduction 
and again in Exercise 114 of Chap. 1. Let X = the number of cereal boxes 
purchased in order to obtain all 10 coupons. 

(a) Use a simulation program to estimate E(X) and SD(X). Also compute the 
estimated standard error of your sample mean. 

(b) How does your estimate of E(X) compare to the theoretical answer given 
in the Introduction? 

(c) Repeat (a) with 20 coupons required instead of 10. Does it appear to take 
roughly twice as long to collect 20 coupons as 10? More than twice as 
long? Less? 

A small high school holds its graduation ceremony in the gym. Because of 

seating constraints, students are limited to a maximum of four tickets to 

graduation for family and friends. Suppose 30% of students want four tickets, 

25% want three, 25% want two, 15% want one, and 5% want none. 

(a) Write a simulation for 150 graduates requesting tickets, where students’ 
requests follow the distribution described above. In particular, keep track 
of the variable T = the total number of tickets requested by these 
150 students. 

(b) The gym can seat a maximum of 410 guests. Based on your simulation, 
estimate the probability that all students’ requests can be accommodated. 


Supplementary Exercises (142-170) 


142. Consider a deck consisting of seven cards, marked 1, 2, ..., 7. Three of these 


143. 


cards are selected at random. Define an rv W by W = the sum of the resulting 
numbers, and compute the pmf of W. Then compute E(W) and Var(W). 
[Hint: Consider outcomes as unordered, so that (1, 3, 7) and (3, 1, 7) are not 
different outcomes. Then there are 35 outcomes, and they can be listed.] 
(This type of rv actually arises in connection with Wilcoxon’s rank-sum test, 
in which there is an x sample and a y sample and W is the sum of the ranks of 
the x’s in the combined sample.) 

After shuffling a deck of 52 cards, a dealer deals out 5. Let X = the number of 
suits represented in the five-card hand. 

(a) Show that the pmf of X is 


¢ lJ y) 3 4 
pix) |.002 146 588.264 


[Hint: p(1) = 4P(all are spades), p(2) = 6P(only spades and hearts with at 
least one of each), and p(4) = 4P(2 spades M one of each other suit).] 
(b) Compute E(X) and SD(X). 
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145. 


146. 


147. 


148. 


149. 
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The negative binomial rv X was defined as the number of trials necessary to 
obtain the rth S. Let Y = the number of F’s preceding the rth S. In the same 
manner in which the pmf of X was derived, derive the pmf of Y. 

Of all customers purchasing automatic garage-door openers, 75% purchase a 

chain-driven model. Let X = the number among the next 15 purchasers who 

select the chain-driven model. 

(a) What is the pmf of X? 

(b) Compute P(X > 10). 

(c) Compute P(6 < X < 10). 

(d) Compute E(X) and SD(X). 

(e) If the store currently has in stock 10 chain-driven models and 8 shaft- 
driven models, what is the probability that the requests of these 
15 customers can all be met from existing stock? 

A friend recently planned a camping trip. He has two flashlights, one that 

required a single 6-V battery and another that used two size-D batteries. He 

had previously packed two 6-V and four size-D batteries in his camper. 

Suppose the probability that any particular battery works is p and that batteries 

work or fail independently of one another. Our friend wants to take just one 

flashlight. For what values of p should he take the 6-V flashlight? 

Binary data are transmitted over a noisy communication channel. The 

probability that a received binary digit is in error due to channel noise is 

0.05. Assume that such errors occur independently within the bit stream. 

(a) What is the probability that the 3rd error occurs on the 50th transmitted 
bit? 

(b) On average, how many bits will be transmitted correctly before the first 
error? 

(c) Consider a 32-bit “word.” What is the probability of exactly 2 errors in 
this word? 

(d) Consider the next 10,000 bits. What approximating model could we use 
for X = the number of errors in these 10,000 bits? Give both the name of 
the model and the value(s) of the parameter(s). 

A manufacturer of flashlight batteries wishes to control the quality of its 
product by rejecting any lot in which the proportion of batteries having 
unacceptable voltage appears to be too high. To this end, out of each large 
lot (10,000 batteries), 25 will be selected and tested. If at least 5 of these 
generate an unacceptable voltage, the entire lot will be rejected. What is the 
probability that a lot will be rejected if 

(a) 5% of the batteries in the lot have unacceptable voltages? 

(b) 10% of the batteries in the lot have unacceptable voltages? 

(c) 20% of the batteries in the lot have unacceptable voltages? 

(d) What would happen to the probabilities in parts (a)-(c) if the critical 
rejection number were increased from 5 to 6? 

Of the people passing through an airport metal detector, .5% activate it; let 

X = the number among a randomly selected group of 500 who activate the 

detector. 
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(a) What is the (approximate) pmf of X? 

(b) Compute P(X = 5). 

(c) Compute P(X > 5). 

An educational consulting firm is trying to decide whether high school 

students who have never before used a handheld calculator can solve a certain 

type of problem more easily with a calculator that uses reverse Polish logic or 
one that does not use this logic. A sample of 25 students is selected and 
allowed to practice on both calculators. Then each student is asked to work one 
problem on the reverse Polish calculator and a similar problem on the other. 

Let p = P(S), where S indicates that a student worked the problem more 

quickly using reverse Polish logic than without, and let X = number of S$’s. 

(a) If p = .5, what is P77 < X < 18)? 

(b) If p = .8, what is P(7 < X < 18)? 

(c) If the claim that p = .5 is to be rejected when either X < 7 or X > 
18, what is the probability of rejecting the claim when it is actually 
correct? 

(d) If the decision to reject the claim p = .5 is made as in part (c), what is the 
probability that the claim is not rejected when p = .6? When p = .8? 

(e) What decision rule would you choose for rejecting the claim p = .5 if you 
wanted the probability in part (c) to be at most .01? 

Consider a disease whose presence can be identified by carrying out a blood 
test. Let p denote the probability that a randomly selected individual has the 
disease. Suppose individuals are independently selected for testing. One way 
to proceed is to carry out a separate test on each of the n blood samples. A 
potentially more economical approach, group testing, was introduced during 
World War II to identify syphilitic men among army inductees. First, take a part 
of each blood sample, combine these specimens, and carry out a single test. If 
no one has the disease, the result will be negative, and only the one test is 
required. If at least one individual is diseased, the test on the combined sample 
will yield a positive result, in which case the n individual tests are then carried 
out. If p = .1 and n = 3, what is the expected number of tests using this 
procedure? What is the expected number when n = 5? [The article “Random 
Multiple-Access Communication and Group Testing” (EEE Trans. Commun., 
1984: 769-774) applied these ideas to a communication system in which the 
dichotomy was active/idle user rather than diseased/nondiseased. ] 

Let p, denote the probability that any particular code symbol is erroneously 
transmitted through a communication system. Assume that on different 
symbols, errors occur independently of one another. Suppose also that with 
probability p2 an erroneous symbol is corrected upon receipt. Let X denote the 
number of correct symbols in a message block consisting of n symbols (after 
the correction process has ended). What is the probability distribution of X? 
The purchaser of a power-generating unit requires c consecutive successful 
start-ups before the unit will be accepted. Assume that the outcomes of individ- 
ual start-ups are independent of one another. Let p denote the probability that 
any particular start-up is successful. The random variable of interest is X = the 
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number of start-ups that must be made prior to acceptance. Give the pmf 

of X for the case c = 2. If p = .9, what is P(X < 8)? [Hint: For x > 5, express 

p(x) “recursively” in terms of the pmf evaluated at the smaller values x — 3, 

x —4,...,2.] (This problem was suggested by the article “Evaluation of a Start- 

Up Demonstration Test,” J. Qual. Tech., 1983: 103-106.) 

A plan for an executive travelers’ club has been developed by an airline on the 

premise that 10% of its current customers would qualify for membership. 

(a) Assuming the validity of this premise, among 25 randomly selected 
current customers, what is the probability that between 2 and 6 (inclusive) 
qualify for membership? 

(b) Again assuming the validity of the premise, what are the expected number 
of customers who qualify and the standard deviation of the number who 
qualify in a random sample of 100 current customers? 

(c) Let X denote the number in a random sample of 25 current customers who 
qualify for membership. Consider rejecting the company’s premise in 
favor of the claim that p > .10 if x > 7. What is the probability that the 
company’s premise is rejected when it is actually valid? 

(d) Refer to the decision rule introduced in part (c). What is the probability 
that the company’s premise is not rejected even though p = .20 (i.e., 20% 
qualify)? 

Forty percent of seeds from maize (modern-day corn) ears carry single 

spikelets, and the other 60% carry paired spikelets. A seed with single 

spikelets will produce an ear with single spikelets 29% of the time, whereas 

a seed with paired spikelets will produce an ear with single spikelets 26% of 

the time. Consider randomly selecting ten seeds. 

(a) What is the probability that exactly five of these seeds carry a single 
spikelet and produce an ear with a single spikelet? 

(b) What is the probability that exactly five of the ears produced by these 
seeds have single spikelets? What is the probability that at most five ears 
have single spikelets? 

A trial has just resulted in a hung jury because eight members of the jury were 
in favor of a guilty verdict and the other four were for acquittal. If the jurors 
leave the jury room in random order and each of the first four leaving the room 
is accosted by a reporter in quest of an interview, what is the pmf of X = the 
number of jurors favoring acquittal among those interviewed? How many of 
those favoring acquittal do you expect to be interviewed? 

A reservation service employs five information operators who receive 

requests for information independently of one another, each according to a 

Poisson process with rate A = 2 per minute. 

(a) What is the probability that during a given 1-min period, the first operator 
receives no requests? 

(b) What is the probability that during a given 1-min period, exactly four of 
the five operators receive no requests? 

(c) Write an expression for the probability that during a given 1-min period, 
all of the operators receive exactly the same number of requests. 
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Grasshoppers are distributed at random in a large field according to a Poisson 

process with parameter A = 2 per square yard. How large should the radius r of 

a circular sampling region be taken so that the probability of finding at least 

one grasshopper in the region equals .99? 

A newsstand has ordered five copies of a certain issue of a photography 

magazine. Let X = the number of individuals who come in to purchase this 

magazine. If X has a Poisson distribution with parameter = 4, what is the 
expected number of copies that are sold? 

Individuals A and B begin to play a sequence of chess games. Let S = {A wins 

a game}, and suppose that outcomes of successive games are independent 

with P(S) = p and P(F) = | — p (they never draw). They will play until one of 

them wins ten games. Let X = the number of games played (with possible 

values 10, 11, ..., 19). 

(a) For x = 10, 11, ..., 19, obtain an expression for p(x) = P(X = x). 

(b) If a draw is possible, with p = P(S), q = P(F), 1 — p — g = P(draw), what 
are the possible values of X? What is P(20 < X)? [Hint: P20 < X) = 1 —- 
P(X < 20).] 

A test for the presence of a disease has probability .20 of giving a false- 

positive reading (indicating that an individual has the disease when this is not 

the case) and probability .10 of giving a false-negative result. Suppose that ten 
individuals are tested, five of whom have the disease and five of whom do not. 

Let X = the number of positive readings that result. 

(a) Does X have a binomial distribution? Explain your reasoning. 

(b) What is the probability that exactly three of the ten test results are 
positive? 

The generalized negative binomial pmf is given by 


nb(x;r,p) = k(r,x) x p’\(1—p)" x =0,1,2,... 
where 
(xtr—(*+r—2)...(x+r—x) 


k(r,x) = x! 
1 x=0 


K= 1,2. 42. 


Let X, the number of plants of a certain species found in a particular region, 

have this distribution with p = .3 and r = 2.5. What is P(X = 4)? What is the 
probability that at least one plant is found? 
There are two certified public accountants (CPAs) in a particular office who 
prepare tax returns for clients. Suppose that for one type of complex tax form, 
the number of errors made by the first preparer has a Poisson distribution with 
mean y;, the number of errors made by the second preparer has a Poisson 
distribution with mean jz, and that each CPA prepares the same number of 
forms of this type. Then if one such form is randomly selected, the function 
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x x 
P(X; Hy, fo) = Se" a ge ee a ee 
x! x! 


gives the pmf of X = the number of errors in the selected form. 

(a) Verify that p(x; 41, H2) is a legitimate pmf (> 0 and sums to 1). 

(b) What is the expected number of errors on the selected form? 

(c) What is the standard deviation of the number of errors on the selected 
form? 

(d) How does the pmf change if the first CPA prepares 60% of all such forms 
and the second prepares 40%? 

The mode of a discrete random variable X with pmf p(x) is that value x* for 

which p(x) is largest (the most probable x value). 

(a) Let X ~ Bin(n, p). By considering the ratio b( + 1; n, p)/b@; n, p), show 
that b(x; n, p) increases with x as long as x < np — (1 — p). Conclude that 
the mode x* is the integer satisfying (1 + l)jp — 1 <x* < (n+ l)p. 

(b) Show that if X has a Poisson distribution with parameter jz, the mode is the 
largest integer less than y. If is an integer, show that both 4 — | and w are 
modes. 

For a particular insurance policy the number of claims by a policy holder in 

5 years is Poisson distributed. If the filing of one claim is four times as likely 

as the filing of two claims, find the expected number of claims. 

If X is ahypergeometric rv, show directly from the definition that E(X) = nM/N 

(consider only the case n < M). [Hint: Factor nM/N out of the sum for 

E(X), and show that the terms inside the sum are a match to the pmf 

hy; n—1, M—1, N—1), where y=x—1.] 

Suppose a store sells two different coffee makers of a particular brand, a basic 

model selling for $30 and a fancy one selling for $50. Let X be the number of 

people among the next 25 purchasing this brand who choose the fancy one. 

Then A(X) = revenue = 50X + 30(25 — X) = 20X + 750, a linear function. 

If the choices are independent and have the same probability, then how is 

X distributed? Find the mean and standard deviation of h(X). Explain why the 

choices might not be independent with the same probability. 

Let X be a discrete rv with possible values 0, 1, 2, ... or some subset of these. The 


function w(s) = E(s*) = Ss -p(x) is called the probability generating 
x=0 


function (pgf) of X. 

(a) Suppose X is the number of children born to a family, and p(O) = .2, 
p(1) = .5, and p(2) = .3. Determine the pgf of X. 

(b) Determine the pgf when X has a Poisson distribution with parameter p. 

(c) Show that w(1) = 1. 

(d) Show that w'(0) = p(1). (You’ll need to assume that the derivative can be 
brought inside the summation, which is justified.) What results from 
taking the second derivative with respect to s and evaluating at s = 0? 
The third derivative? Explain how successive differentiation of w(s) and 
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evaluation at s = 0 “generates the probabilities in the distribution.” Use 
this to recapture the probabilities of (a) from the pgf. [Note: This shows 
that the pgf contains all the information about the distribution—knowing 
w(s) is equivalent to knowing p(x).] 

Consider a collection A,, ..., A, of mutually exclusive and exhaustive events 
(a partition) and a random variable X whose distribution depends on which of 
the Ajs occurs. (e.g., a commuter might select one of three possible routes 
from home to work, with X representing commute time.) Let E(X | A;) denote 
the expected value of X given that event A; occurs. Then, analogous to the Law 
of Total Probability, it can be shown that the overall mean of X is given by the 
weighted average E(X) = > E(X1A;)P(A;) 

(a) The expected duration of a voice call to a particular office telephone 
number is 3 min, whereas the expected duration of a data call to that 
same number is | min. If 75% of all calls are voice calls, what is the 
expected duration of the next call? 

(b) A bakery sells three different types of chocolate chip cookies. The number 
of chocolate chips on a type 7 cookie has a Poisson distribution with mean 
“MW; =it+1@=1, 2, 3). If 20% of all customers select a cookie of the first 
type, 50% choose the second type, and 30% opt for the third type, what is 
the expected number of chocolate chips in the next customer’s cookie? 

Consider a sequence of identical and independent trials, each of which will be 

a success S or failure F. Let p = P(S) and q = P(F). 

(a) Let X = the number of trials necessary to obtain the first $, a geometric 

rv. Here is an alternative approach to determining E(X). Apply the 

weighted average formula from the previous exercise with k = 2, A; = 

{S on Ist trial}, and Ay = A’. Show that E(X) = 1/p. [Hint: Denote E(X) by 

uu. Given that the first trial is a failure, one trial has been performed and, 

starting from the 2nd trial, we are still looking for the first S. This implies 

that E(XIA’) = 1+ p.] 

Now let Y = the number of trials necessary to obtain two consecutive S’s. 

It is not possible to determine E(Y) directly from the definition of 

expected value, because there is no formula for the pmf of Y; the compli- 

cation is the word consecutive. Use the weighted average formula to 

determine E(Y). [Hint: Consider the partition with k = 3 and A, = {F}, 

Ag = {SS}, Az = {SF}.] 
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Continuous Random Variables 
and Probability Distributions 


As emphasized at the beginning of Chap. 2, the two important types of random 
variables are discrete and continuous. In this chapter, we study the second general 
type of random variable that arises in many applied problems. Sections 3.1 and 3.2 
present the basic definitions and properties of continuous random variables, their 
probability distributions, and their various expected values. The normal distribution, 
arguably the most important and useful model in all of probability and statistics, is 
introduced in Sect. 3.3. Sections 3.4 and 3.5 discuss some other continuous 
distributions that are often used in applied work. In Sect. 3.6, we introduce a method 
for assessing whether given sample data is consistent with a specified distribution. 
Section 3.7 presents methods for obtaining the distribution of a rv Y from the 
distribution of X when the two are related by some equation Y= g(X). The last 
section of this chapter is dedicated to the simulation of continuous rvs. 


3.1 Probability Density Functions and Cumulative 
Distribution Functions 


A discrete random variable (rv) is one whose possible values either constitute a 
finite set or else can be listed in an infinite sequence (a list in which there is a first 
element, a second element, etc.). A random variable whose set of possible values is 
an entire interval of numbers is not discrete. 

Recall from the beginning of Chap. 2 that a random variable X is continuous if 
(1) its possible values comprise either a single interval on the number line (for some 
A<B, any number x between A and B is a possible value) or a union of disjoint 
intervals, and (2) P(X =c) =0 for any number c that is a possible value of X. 


Example 3.1 If in the study of the ecology of a lake, we make depth 
measurements at randomly chosen locations, then X = the depth at such a location 
is a continuous rv. Here A is the minimum depth in the region being sampled, and 
B is the maximum depth. = 


M.A. Carlton and J.L. Devore, Probability with Applications in Engineering, Science, and 179 
Technology, Springer Texts in Statistics, DOI 10.1007/978-1-4939-0395-5_3, 
© Springer Science+Business Media New York 2014 
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Example 3.2 If a chemical compound is randomly selected and its pH X is 
determined, then X is a continuous rv because any pH value between 0 and 14 is 
possible. If more is known about the compound selected for analysis, then the set of 
possible values might be a subinterval of [0, 14], such as 5.5 <x < 6.5, but X would 
still be continuous. a 


Example 3.3 Let X represent the amount of time a randomly selected customer 
spends waiting for a haircut. Your first thought might be that X is a continuous 
random variable, since a measurement is required to determine its value. However, 
there are customers lucky enough to have no wait whatsoever before climbing into 
the barber or stylist’s chair. So it must be the case that P(X = 0) > 0. Conditional on 
no chairs being empty, however, the waiting time will be continuous since X could 
then assume any value between some minimum possible time A and a maximum 
possible time B. This random variable is neither purely discrete nor purely continu- 
ous but instead is a mixture of the two types. = 


One might argue that although in principle variables such as height, weight, and 
temperature are continuous, in practice the limitations of our measuring instruments 
restrict us to a discrete (though sometimes very finely subdivided) world. However, 
continuous models often approximate real-world situations very well, and continu- 
ous mathematics (the calculus) is frequently easier to work with than the mathe- 
matics of discrete variables and distributions. 


3.1.1 Probability Distributions for Continuous Variables 


Suppose the variable X of interest is the depth of a lake at a randomly chosen point 
on the surface. Let M = the maximum depth (in meters), so that any number in the 
interval [0, M] is a possible value of X. If we “discretize” X by measuring depth to 
the nearest meter, then possible values are nonnegative integers less than or equal to 
M. The resulting discrete distribution of depth can be pictured using a probability 
histogram. If we draw the histogram so that the area of the rectangle above any 
possible integer & is the proportion of the lake whose depth is (to the nearest meter) 
k, then the total area of all rectangles is 1. A possible histogram appears in Fig. 3. 1a. 

If depth is measured much more precisely and the same measurement axis as in 
Fig. 3.1a is used, each rectangle in the resulting probability histogram is much 
narrower, although the total area of all rectangles is still 1. A possible histogram is 
pictured in Fig. 3.1b; it has a much smoother appearance than the histogram in 
Fig. 3.1a. If we continue in this way to measure depth more and more finely, the 
resulting sequence of histograms approaches a smooth curve, as pictured in 
Fig. 3.1c. Because for each histogram the total area of all rectangles equals 1, the 
total area under the smooth curve is also 1. The probability that the depth at a 
randomly chosen point is between a and b is just the area under the smooth curve 
between a and 5. It is exactly a smooth curve of the type pictured in Fig. 3.1c that 
specifies a continuous probability distribution. 
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Fig. 3.1 (a) Probability histogram of depth measured to the nearest meter; (b) probability 
histogram of depth measured to the nearest centimeter; (c) a limit of a sequence of discrete 
histograms 


DEFINITION 

Let X be a continuous rv. Then a probability distribution or probability 
density function (pdf) of X is a function f(x) such that for any two numbers 
aand b witha <b, 


b 
P(a<X <b) =| f (x)dx 
That is, the probability that X takes on a value in the interval [a, b] is the 
area above this interval and under the graph of the density function, as 
illustrated in Fig. 3.2. The graph of f(x) is often referred to as the density 
curve. 


Six) 


a b 


Fig. 3.2. P(a<X <b)=the area under the density curve between a and b 


For f(x) to be a legitimate pdf, it must satisfy the following two conditions: 


1. f(x) => 0 for all x 
2: Se j(x)dx = [area under the entire graph of f(x)] = 1 


Example 3.4 The direction of an imperfection with respect to a reference line on a 
circular object such as a tire, brake rotor, or flywheel is often subject to uncertainty. 
Consider the reference line connecting the valve stem on a tire to the center point, 
and let X be the angle measured clockwise to the location of an imperfection. One 
possible pdf for X is 
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1 
— O<x< 360 
f(a) = { 360 


O otherwise 


The pdf is graphed in Fig. 3.3. Clearly f(x) > 0. The area under the density curve 
is just the area of a rectangle: (height) (base) = (;4;)(360) = 1. The probability that 
the angle is between 90° and 180° is 
x=180 


a5 
g= 25 


180 


Xx 
P(90 < x < 180) = | gy 
( ) o9 360 360 


x=90 
The probability that the angle of occurrence is within 90° of the reference line is 


P(0 < X < 90) + P(270 < X < 360) = .25 +.25 = .50 


SQ) S09) 
A A 


Shaded area = P(90 < X < 180) 


| >X 


Fig. 3.3. The pdf and probability for Example 3.4 a 


Because whenever 0 < a<b< 360 in Example 3.4, P(a< X <b) depends only 
on the width b — a of the interval, X is said to have a uniform distribution. 


DEFINITION 
A continuous rv X is said to have a uniform distribution on the interval 
[A, B] if the pdf of X is 


Ba 8 
feABi=< oA 


0 otherwise 


The statement that X has a uniform distribution on [A, B] will be denoted 
X ~ Unif[A, B]. 


3.1 Probability Density Functions and Cumulative Distribution Functions 183 


The graph of any uniform pdf looks like the graph in Fig. 3.3 except that the 
interval of positive density is [A, B] rather than [0, 360). 

In the discrete case, a probability mass function (pmf) tells us how little “blobs” 
of probability mass of various magnitudes are distributed along the measurement 
axis. In the continuous case, probability density is “smeared” in a continuous 
fashion along the interval of possible values. When density is smeared evenly 
over the interval, a uniform pdf, as in Fig. 3.3, results. 

When X is a discrete random variable, each possible value is assigned positive 
probability. This is not true of a continuous random variable, because the area under 
a density curve that lies above any single value is zero: 

P(X =c) =P(c<X<c)= | f(x)dx =0 

The fact that P(X=c)=0 when X is continuous has an important practical 
consequence: The probability that X lies in some interval between a and b does 
not depend on whether the lower limit a or the upper limit b is included in the 
probability calculation: 


P(a<X<b)=P(a<X <b) =P(a<X<b)=P(a<X<b) (3.1) 


In contrast, if X were discrete and both a and b are possible values of X (e.g., 
X ~Bin(20, .3) and a=5, b=10), then all four of the probabilities in Eq. (3.1) 
would be different. This also means that whether we include the endpoints of the 
range of values for a continuous rv X is somewhat arbitrary; for example, the pdf in 
Example 3.4 could be defined to be positive on (0, 360) or [0, 360] rather than 
[0, 360), and the same applies for a uniform distribution on [A, B] in general. 

The zero probability condition has a physical analog. Consider a solid circular 
rod (with cross-sectional area of 1 in* for simplicity). Place the rod alongside a 
measurement axis and suppose that the density of the rod at any point x is given by 
the value f(x) of a density function. Then if the rod is sliced at points a and b and this 
segment is removed, the amount of mass removed is [2foddx; however, if the rod is 
sliced just at the point c, no mass is removed. Mass is assigned to interval segments 
of the rod but not to individual points. 

So, if P(X =c)=0 when X is a continuous rv, then what does f(c) represent? 
After all, if X were discrete, its pmf evaluated at x=c, p(c), would indicate the 
probability that X equals c. To help understand what f(c) means, consider a small 
window near x =c—-say, [c, c+ Ax]. Using a rectangle to approximate the area 
under f(x) between c and c+Ax (the usual “Riemann approximation” idea from 
calculus), one obtains (eine Saxdx = Ax-f(c), from which 


c+Ax 
| LOG peeK ees Ke 


flow” Ax ~ Ax 


This indicates that f(c) is not a probability, but rather roughly the probability of 
an interval divided by the length of the chosen interval. If we associate mass with 
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probability and remember that interval length is the one-dimensional analog of 
volume, then frepresents their quotient, mass per volume, more commonly known 
as density (hence, the name pdf). The height of the function f(x) at a particular point 
reflects how “dense” the values of X are near that point—taller sections of f(x) 
contain more probability within a fixed interval length than do shorter sections. 


Example 3.5 “Time headway” in traffic flow is the elapsed time between the time 
that one car finishes passing a fixed point and the instant that the next car begins to 
pass that point. Let X =the time headway for two randomly chosen consecutive 
cars on a freeway during a period of heavy flow. The following pdf of X is 
essentially the one suggested in “The Statistical Properties of Freeway Traffic” 
(Transp. Res., 11: 221-228): 


= 15e7: 0-5) x> 5 
T= { 0 otherwise 


The graph of f(x) is given in Fig. 3.4; there is no density associated with headway 
times less than .5, and headway density decreases rapidly (exponentially fast) as 
x increases from .5. The fact that the graph of f(x) is taller near x= .5 and shorter 
near, say, x= 10 indicates that time headway values are more dense near the left 
boundary, i.e., there is a higher proportion of time headways in the interval [.5, 1.5] 
than in [10, 11], even though these two intervals have the same length. 

Clearly, f(x) >0; to show that ie fa@)dx=1 we use the calculus result 


frre “dx = (/be™. Then 


foe) i ln) 
| f (x)dx -| odv+- | be age 


—oo 5 


= 15¢e:975 eo dy = 15075 . 1 o-15(5) -1 
5 15 


The probability that headway time is at most 5 s is 


f(x) 
15 P(X <5) 


5 > 10 ie) 


Fig. 3.4 The density curve for headway time in Example 3.5 
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5 5 5 
P(X < 5) — | f(x) dx = | (1507 bO- 5) dax= Ase" eo * dy 
—0oo Ss) 


5 


x=.5 


= 9? (—e~7 +e %) = 1.078(—.472 + .928) = .491 


Since X is a continuous rv, .491 also equals P(X <5), the probability that 
headway time is (strictly) less than 5 s. The difference between these two events 
is {X=5}, ie., that headway time is exactly 5 s, which has probability zero: 
P(X=5)=[? Ax)dx=0. 

This last statement may feel uncomfortable to you: Is there really zero chance 
that the headway time between two cars is exactly 5 s? If time is treated as 
continuous, then “exactly 5 s” means X = 5.000..., with an endless repetition of 
Os. That is to say, X isn’t rounded to the nearest second (or even tenth of a second); 
we are asking for the probability that X equals one specific number, 5.000. . ., out of 
the (uncountably) infinite collection of possible values of X. a 


Unlike discrete distributions such as the binomial, hypergeometric, and negative 
binomial, the distribution of any given continuous rv cannot usually be derived 
using simple probabilistic arguments. Instead, one must make a judicious choice of 
pdf based on prior knowledge and available data. Fortunately, some general pdf 
families have been found to fit well in a wide variety of experimental situations; 
several of these are discussed later in the chapter. 

Just as in the discrete case, it is often helpful to think of the population of interest 
as consisting of X values rather than individuals or objects. The pdf is then a model 
for the distribution of values in this numerical population, and from this model 
various population characteristics (such as the mean) can be calculated. 

Several of the most important concepts introduced in the study of discrete 
distributions also play an important role for continuous distributions. Definitions 
analogous to those in Chap. 2 involve replacing summation by integration. 


3.1.2. The Cumulative Distribution Function 


The cumulative distribution function (cdf) F(x) for a discrete rv X gives, for any 
specified number x, the probability P(X <x). It is obtained by summing the pmf 
P()) over all possible values y satisfying y < x. The cdf of a continuous rv gives the 
same probabilities P(X < x) and is obtained by integrating the pdf f(y) between the 
limits —oo and x. 
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DEFINITION 
The cumulative distribution function F(x) for a continuous rv X is defined 
for every number x by 


F(x) = P(X <x)= [ Ff (y)dy 


For each x, F(x) is the area under the density curve to the left of x. This is 
illustrated in Fig. 3.5, where F(x) increases smoothly as x increases. 


Fig. 3.5 A pdf and associated cdf 


Example 3.6 Let X, the thickness of a membrane, have a uniform distribution on 
[A, B]. The density function is shown in Fig. 3.6. 

For x <A, F(x) = 0, since there is no area under the graph of the density function 
to the left of such an x. For x > B, F(x) = 1, since all the area is accumulated to the 
left of such an x. Finally, for A <x <B, 


% KG 1 1 y=x 
Fay =| foldy=| p44 = ‘yoo 


Shaded area = F(x) 


—!_ | ——— esl, 
B-A B-A 


¥ 
y 


Fig. 3.6 The pdf for a uniform distribution 
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The entire cdf is 


0 x<A 
x—A A<HeR 
F(X)= 4 BA = 
1 x>B 
F(x) 
1 
1 i 
A x 
Fig. 3.7. The cdf for a uniform distribution t | 


The graph of this cdf appears in Fig. 3.7. 


3.1.3. Using F(x) to Compute Probabilities 


The importance of the cdf here, just as for discrete rvs, is that probabilities of 
various intervals can be computed from a formula or table for F(x). 


PROPOSITION 
Let X be a continuous rv with pdf f(x) and cdf F(x). Then for any number a, 


P(X > a) =1- F(a) 
and for any two numbers a and b with a<b, 


P(a <X <b) = F(b) — F(a) 


Figure 3.8 illustrates the second part of this proposition; the desired probability 
is the shaded area under the density curve between a and Jb, and it equals the 
difference between the two shaded cumulative areas. This is different from what 
is appropriate for a discrete integer-valued rv (e.g., binomial or Poisson): 
P(a<X <b)=F(b) — F(a— 1) when a and Db are integers. 


Example 3.7 Suppose the pdf of the magnitude X of a dynamic load on a bridge 
(in newtons) is given by 
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Fig. 3.8 Computing P(a < X <b) from cumulative probabilities 


0 otherwise 


For any number x between 0 and 2, 


x x/1 3 © 3x2 
Fs) =| f(y)dy -| (; sy) dy 34 7 


Thus 
0 x<O0 
x 3x Zee 
oe a 
1 2<x 


The graphs of f(x) and F(x) are shown in Fig. 3.9. The probability that the load is 
between | and 1.5 N is 


P(1<X < 1.5) =F(1.5) — F(1) = ps) + 5) = ge +30" 


SOx) 4 F(x) 4 
7 iy 
8 
1 
8 1 > xX L > xX 
0 2 2 


Fig. 3.9 The pdf and cdf for Example 3.7 a 
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The beauty of the cdf in the continuous case is that once it is available, any 
probability involving X can easily be calculated without any further integration. 


3.1.4 Obtaining f(x) from F(x) 


For X discrete, the pmf is obtained from the cdf by taking the difference between 
two F(x) values. The continuous analog of a difference is a derivative. The 
following result is a consequence of the Fundamental Theorem of Calculus. 


PROPOSITION 
If X is a continuous rv with pdf f(x) and cdf F(x), then at every x at which the 
derivative F’(x) exists, F’(x) =f(). 


Example 3.8 (Example 3.6 continued) When X ~ Unif[A, B], F(x) is differentiable 
except at x = A and x = B, where the graph of F(x) has sharp corners. Since F(x) = 0 for 
x <A and F(x) = 1 for x > B, F’(x) =0 =f») for such x. For A <x < B, 


Fa) =4 (3 =) = 4 =f) . 


3.1.5 Percentiles of a Continuous Distribution 


When we say that an individual’s test score was at the 85th percentile of the 
population, we mean that 85% of all population scores were below that score and 
15% were above. Similarly, the 40th percentile is the score that exceeds 40% of all 
scores and is exceeded by 60% of all scores. 


DEFINITION 
Let p be a number between 0 and 1. The (100p)th percentile of the distribu- 
tion of a continuous rv X, denoted by y,, is defined implicitly by the equation 


Mp 


== | f(y)dy (3.2) 


Assuming we can find the inverse of F(x), this can also be written as 
Np = F-"(p) 


In particular, the median of a continuous distribution is the 50th percen- 
tile, 7.5 or F Con That is, half the area under the density curve is to the left of 
the median and half is to the right of the median. We will occasionally denote 
the median of a distribution simply as n (i.e., without the .5 subscript). 
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SIX) 4 F(X) 4 
OF 1.0 
Ar 8 
.3- Shaded area = P 6 
2 4 

p= F(1,)-—> 
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gu Lp x 0 
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Fig. 3.10 The (100p)th percentile of a continuous distribution 


According to Expression (3.2), 77, is that value on the measurement axis such that 
100p% of the area under the graph of f(x) lies to the left of 7,, and 100(1 — p)% lies 
to the right. Thus 775, the 75th percentile, is such that the area under the graph of 
f(x) to the left of 7.75 is .75. Figure 3.10 illustrates the definition. 


Example 3.9 The distribution of the amount of gravel (in tons) sold by a construc- 
tion supply company in a given week is a continuous rv X with pdf 


*(1-22) O<x<1 


fx) = 


0 otherwise 
The cdf of sales for any x between 0 and 1 is 


ies ee yy. 3 - 
roy = [130-9963 0-3) 3-9 


The graphs of both f(x) and F(x) appear in Fig. 3.11. The (100p)th percentile of 
this distribution satisfies the equation 


3 Mp 
Pp =F (%) =3(1,-4) 


that is, 
1, — 3ny + 2p =0 


For the median, p=.5 and the equation to be solved is 7° —37+1=0; the 
solution is 7 = .347. If the distribution remains the same from week to week, then 
in the long run 50% of all weeks will result in sales of less than .347 tons and 50% in 
more than .347 tons. 
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K(x) F(x) 
2 1 


0 1 x 0 .347 1 x 
Fig. 3.11 The pdf and cdf for Example 3.9 a 


A continuous distribution whose pdf is symmetric—which means that the graph 
of the pdf to the left of some point is a mirror image of the graph to the right of that 
point—has median 7 equal to the point of symmetry, since half the area under the 
curve lies to either side of this point. Figure 3.12 gives several examples. The 
amount of error in a measurement of a physical quantity is often assumed to have a 
symmetric distribution. 


f(x) f(x) f(x) ant 
x /\ > xX 1 > xX 
0 ul 


A n B 


Fig. 3.12 Medians of symmetric distributions 


3.1.6 Exercises: Section 3.1 (1-18) 


1. The current in a certain circuit as measured by an ammeter is a continuous 
random variable X with the following density function: 


c=. 3<x<5 


0 otherwise 


(a) Graph the pdf and verify that the total area under the density curve is 
indeed 1. 

(b) Calculate P(X < 4). How does this probability compare to P(X < 4)? 

(c) Calculate P(3.5 << X <4.5) and P(X > 4.5). 

2. Suppose the reaction temperature X (in °C) in a chemical process has a 

uniform distribution with A= —5 and B=5. 

(a) Compute P(X <0). 

(b) Compute P(—2.5 < X < 2.5). 

(c) Compute P(—2 < X <3). 

(d) For k satisfying -—5<k<k+4<5, compute P(k<X<k+4). Interpret 
this in words. 


192 


3 Continuous Random Variables and Probability Distributions 


. Suppose the error involved in making a measurement is a continuous rv X with 


pdf 


fx) = 


iad ax <2 
0 otherwise 


(a) Sketch the graph of f(x). 

(b) Compute P(X > 0). 

(c) Compute P(—1< X <1). 

(d) Compute P(X < —.5 or X>.5). 


. Let X denote the vibratory stress (psi) on a wind turbine blade at a particular 


wind speed in a wind tunnel. The article “Blade Fatigue Life Assessment with 
Application to VAWTS” (J. Solar Energy Engr., 1982: 107-111) proposes the 
Rayleigh distribution, with pdf 


Xx —1?/ (26) 
—— & e 
F(38) = 5 & ee 


0 otherwise 


as a model for X, where @ is a positive constant. 

(a) Verify that f(x; @) is a legitimate pdf. 

(b) Suppose # = 100 (a value suggested by a graph in the article). What is the 
probability that X is at most 200? Less than 200? At least 200? 

(c) What is the probability that X is between 100 and 200 (again assuming 
é= 100)? 

(d) Give an expression for the cdf of X. 


. A college professor never finishes his lecture before the end of the hour and 


always finishes his lectures within 2 min after the hour. Let X = the time that 
elapses between the end of the hour and the end of the lecture and suppose the 
pdf of X is 


4. |e Dex e2 
i= { 0 otherwise 


(a) Find the value of k and draw the corresponding density curve. [Hint: Total 
area under the graph of f(x) is 1.] 

(b) What is the probability that the lecture ends within 1 min of the end of the 
hour? 

(c) What is the probability that the lecture continues beyond the hour for 
between 60 and 90 s? 

(d) What is the probability that the lecture continues for at least 90 s beyond 
the end of the hour? 


. The actual tracking weight of a stereo cartridge that is set to track at 3 g ona 


particular changer can be regarded as a continuous rv X with pdf 
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f(x) = if —(x-3)] 2<x<4 


0 otherwise 


(a) Sketch the graph of f(x). 

(b) Find the value of k. 

(c) What is the probability that the actual tracking weight is greater than the 
prescribed weight? 

(d) What is the probability that the actual weight is within .25 g of the 
prescribed weight? 

(e) What is the probability that the actual weight differs from the prescribed 
weight by more than .5 g? 

7. The article “Second Moment Reliability Evaluation vs. Monte Carlo 
Simulations for Weld Fatigue Strength” (Quality and Reliability Engr. Intl., 
2012: 887-896) considered the use of a uniform distribution with A = .20 and 
B=4.25 for the diameter X of a certain type of weld (mm). 

(a) Determine the pdf of X and graph it. 

(b) What is the probability that diameter exceeds 3 mm? 

(c) What is the probability that diameter is within | mm of the mean 
diameter? 

(d) For any value a satisfying .20 <<a<a+1< 4.25, what is P(a<X <a+1)? 

8. Commuting to work requires getting on a bus near home and then transferring 
to a second bus. If the waiting time (in minutes) at each stop has a Unif[0, 5] 
distribution, then it can be shown that the total waiting time Y has the pdf 


1 
= O0<y<5 
25 = 
fx)=42 1 
=-=S 5<y<10 
5. 25> a> 
0 y<0Oory> 10 


(a) Sketch the pdf of Y. 
(b) Verify that i en fO)dy=1. 
(c) What is the probability that total waiting time is at most 3 min? 
(d) What is the probability that total waiting time is at most 8 min? 
(e) What is the probability that total waiting time is between 3 and 8 min? 
(f) What is the probability that total waiting time is either less than 2 min or 
more than 6 min? 
9. Consider again the pdf of X = time headway given in Example 3.5. What is the 
probability that time headway is 
(a) At most 6 s? 
(b) More than 6 s? At least 6 s? 
(c) Between 5 and 6 s? 
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10. 


11. 


12. 


13. 
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A family of pdfs that has been used to approximate the distribution of income, 
city population size, and size of firms is the Pareto family. The family has two 
parameters, k and 0, both > 0, and the pdf is 


ko >0 
—- x 
F(x; k,@) = 9 x4! i. 
0 x<@ 


(a) Sketch the graph of fa; k, 6). 

(b) Verify that the total area under the graph equals 1. 

(c) If the rv X has pdf f(x; k, 8), obtain an expression for the cdf of X. 

(d) For €<a<b, obtain an expression for the probability Pia < X <b). 

(e) Find an expression for the (100p)th percentile ,. 

Let X denote the amount of time a book on 2-h reserve is actually checked out, 
and suppose the cdf is 


0 x <0 
42 
F(x) = - 0<x<2 
1 2<x 
Use this to compute the following: 
(a) PX< 1) 
(b) P(S<X< 1) 
(c) P(X>.5) 


(d) The median checkout duration 7 [Hint: Solve F(y) =.5.] 
(e) F’(x) to obtain the density function f(x) 
The cdf for X = measurement error of Exercise 3 is 


0 x<—2 
1 3 3 
F(x) = 5139 4x Peek D 
1 2<x 


(a) Compute P(X <0). 

(b) Compute P(—1 <X <1). 

(c) Compute P(X > .5). 

(d) Verify that f(x) is as given in Exercise 3 by obtaining F’(x). 

(e) Verify that 7 =0. 

Example 3.5 introduced the concept of time headway in traffic flow and 
proposed a particular distribution for X = the headway between two randomly 
selected consecutive car. Suppose that in a different traffic environment, the 
distribution of time headway has the form 


3.1 


14. 


15. 


16. 


17. 
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(a) Determine the value of k for which f(x) is a legitimate pdf. 

(b) Obtain the cumulative distribution function. 

(c) Use the cdf from (b) to determine the probability that headway exceeds 2 s 
and also the probability that headway is between 2 and 3 s. 

Let X denote the amount of space occupied by an article placed in a 1-ft* 

packing container. The pdf of X is 


_ f908(1-—x) O<x<1 
i { 0 otherwise 


(a) Graph the pdf. Then obtain the cdf of X and graph it. 

(b) What is P(X <.5) [1.e., F(.5)]? 

(c) Using part (a), what is P(.25 <X <.5)? What is P(.25<X <.5)? 

(d) What is the 75th percentile of the distribution? 

Answer parts (a)—-(d) of Exercise 14 for the random variable X, lecture time 
past the hour, given in Exercise 5. 

The article “A Model of Pedestrians’ Waiting Times for Street Crossings at 
Signalized Intersections” (Transportation Research, 2013: 17-28) suggested 
that under some circumstances the distribution of waiting time X could be 
modeled with the following pdf: 


0 0-1 
ee 7 il - 4/2) O<x<t 


0) otherwise 


where 0, t > 0. 

(a) Graph f(x; @, 80) for the three cases 6 = 4, 1, and .5 (these graphs appear in 
the cited article) and comment on their shapes. 

(b) Obtain the cumulative distribution function of X. 

(c) Obtain an expression for the median of the waiting time distribution. 

(d) For the case = 4 and zt = 80, calculate P(50 < X < 70) without doing any 
additional integration. 

Let X be a continuous rv with cdf 
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[This type of cdf is suggested in the article “Variability in Measured 
Bedload-Transport Rates” (Water Resources Bull., 1985: 39-48) as a model 
for a hydrologic variable.] What is 
(a) PX <1)? 

(b) PA <X <3)? 
(c) The pdf of X? 


18. Let X be the temperature in °C at which a chemical reaction takes place, and 


3.2 


let Y be the temperature in °F (so Y= 1.8X +32). 

(a) If the median of the X distribution is 7, show that 1.87 + 32 is the median of 
the Y distribution. 

(b) How is the 90th percentile of the Y distribution related to the 90th 
percentile of the X distribution? Verify your conjecture. 

(c) More generally, if Y=aX-+b, how is any particular percentile of the 
Y distribution related to the corresponding percentile of the X distribution? 


Expected Values and Moment Generating Functions 


In Sect. 3.1 we saw that the transition from a discrete cdf to a continuous cdf entails 
replacing summation by integration. The same thing is true in moving from 
expected values of discrete variables to those of continuous variables. 


3.2.1 


Expected Values 


For a discrete random variable X, the mean py or E(X) was defined as a weighted 
average and obtained by summing x - p(x) over possible X values. Here we replace 
summation by integration and the pmf by the pdf to get a continuous weighted 
average. 


DEFINITION 
The expected value or mean value of a continuous rv X with pdf f(x) is 


w= me =O) =| xf) 


—0o 


Example 3.10 (Example 3.9 continued) The pdf of weekly gravel sales X was 


so 


fla) = S(I-¥) 0<x<1 


0 otherwise 
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x=1 3 


» 8 


con-[ nsf acdl oe 3(5-4) 


xa 


If gravel sales are determined week after week according to the given pdf, then 
the long-run average value of sales per week will be .375 ton. = 


Similar to the interpretation in the discrete case, the mean value yw can be 
regarded as the balance point (or fulcrum or center of mass) of a continuous 
distribution. In Example 3.10, if a piece of cardboard were cut out in the shape 
of the region under the density curve f(x), then it would balance if supported at 
p= 3/8 along the bottom edge. When a pdf f(x) is symmetric, then it will balance at 
its point of symmetry, which must be the mean y. Recall from Sect. 3.1 that the 
median is also the point of symmetry; in general, if a distribution is symmetric and 
the mean exists, then it is equal to the median. 

Often we wish to compute the expected value of some function /(X) of the rv X. 
If we think of A(X) as a new rv Y, methods from Sect. 3.7 can be used to derive the 
pdf of Y, and E(Y) can be computed from the definition. Fortunately, as in the 
discrete case, there is an easier way to compute E[h(X)]. 


PROPOSITION 
If X is a continuous rv with pdf f(x) and h(X) is any function of X, then 


pny = EU) =| Hls) faa 


—oo 


This is sometimes called the Law of the Unconscious Statistician. 


Importantly, except in the cases where h(x) is a linear function (see later in this 
section), E[h(X)] is not equal to h(x), the function / evaluated at the mean of X. 


Example 3.11 The variation in a certain electrical current source X (in milliamps) 
can be modeled by the pdf 


Fee 1.25—.25x 2<x<4 
co 0 otherwise 


The average current from this source is 


4 
17 
E(X) = | x(1.25 — .25x)dx = ra 2.833mA 
2 
If this current passes through a 220-© resistor, the resulting power 
(in microwatts) is given by the expression h(X) = (current)*(resistance) = 220X". 
The expected power is given by 
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4 
E(h(X)) = E(220X?) = | 220x" (1.25 — .25x)dx = ~ = 1833.3nW 
2 


Notice that the expected power is not equal to 220(2.833)°, a common error that 
results from substituting the mean current py into the power formula. a 


Example 3.12 Two species are competing in a region for control of a limited 
amount of a resource. Let X = the proportion of the resource controlled by species 


1 and suppose X has pdf 
1 O0<x<l 
ih) = e otherwise 


which is a uniform distribution on [0, 1]. (in her book Ecological Diversity, E. C. 
Pielou calls this the “broken-stick” model for resource allocation, since it is 
analogous to breaking a stick at a randomly chosen point.) Then the species that 
controls the majority of this resource controls the amount 


1-xX if 0<X<5 


h(X) = max(X,1—X) = 
<X<1 


The expected amount controlled by the species having majority control is then 


max(x, 1 — x) -f(x)dx = i max(x,1—x)-1 dx 
0 


10.@) 


(Hx)] = | 


—0o 


In the discrete case, the variance of X was defined as the expected squared 
deviation from yu and was calculated by summation. Here again integration replaces 
summation. 


DEFINITION 
The variance of a continuous random variable X with pdf f(x) and mean value 
pis 

[o.@) 


a= vax) =| (xp)? Fla) dr = E[(X -1)| 


—0o 


The standard deviation of X is oy = SD(X) = \/Var(X). 
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As in the discrete case, Oo is the expected or average squared deviation about 
the mean p, and oy can be interpreted roughly as the size of a representative 
deviation from the mean value yz. Note that oy has the same units as X itself. 


Example 3.13 Let X ~ Unif[A, B]. Since a uniform distribution is symmetric, the 
mean of X is at the density curve’s point of symmetry, which is clearly the midpoint 
(A +B)/2. This can be verified by integration: 


B 2 
1 1 
p=| dx z 


7 P11 BPA? A+B 
4 B-A B-A2 


, B-A 2 2 


The variance of X is then given by 


° 21 1 [? (assy 
2_ ’ ; _ 3 
o =| (x — p) Boa ral, Xx 5 dx 


t fe? .s A+B 
= —— ue du substitute u = x — ——— 
B—AJ_@_a/2 2 

2 p(B-A)/2 
a zal uw? du symmetry 

_ 0 
eer 2 @-ay Gay 
~ B-A3 |, ~~ B-A 2.300 12 


The standard deviation of X is the square root of the variance: o = (B — A)/v/12. 
Notice that the standard deviation of a Unif[A, B] distribution is proportional to the 
length of the interval, B—A, which matches our intuitive notion that a larger 
standard deviation corresponds to greater “spread” in a distribution. = 


Section 2.3 presented several properties of expected value, variance, and 
standard deviation for discrete random variables. Those same properties hold for 
the continuous case; proofs of these results are obtained by replacing summation 
with integration in the proofs presented in Chap. 2. 


PROPOSITION 
Let X be a continuous rv with pdf f(x), mean jp, and standard deviation o. 
Then the following properties hold. 


love) lo) 2) 
1. (variance shortcut) Var(X) = E(X’) — y? =| x -fix)dx— (| x: Fd) 


—oo 


2. (Chebyshev’s inequality) For any constant k > 1, 


(continued) 
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1 
P(|X — p| > ko) Sa 


3. (linearity of expectation) For any functions 4,(X) and h2(X) and any 
constants a), a>, and b, 


E|ayhy(X) + agho(X) +b] = aE [hy (X)] + @E[h2(X)] +b 
4. (rescaling) For any constants a and b, 


E(aX + b) =an+b Var(aX + b) = ao" Oaxh = |alo 


Example 3.14 (Example 3.10 continued) For X = weekly gravel sales, we computed 
E(X) = 3/8. Since 


E(X’) = = -f (x)dx -[ iu —x)dx = =| (x? —x4)dx = 


1 /3\? 19 
Var(X) = 57 (3) =—= .059 and oy =.244 

Suppose the amount of gravel actually received by customers in a week is 
h(x)=X—- 02X?; the second term accounts for the small amount that is lost in 
transport. Then the average weekly amount received by customers is 


E(X — .02X*) = E(X) — .02E(X*) = ; — .02- : = .371 tons 7 


Example 3.15 When a dart is thrown at a circular target, consider the location 
of the landing point relative to the bull’s eye. Let X be the angle in degrees 
measured from the horizontal, and assume that X ~ Unif[0, 360). By Example 
3.13, E(X) = 180 and SD(X) = 360/\/12. Define Y to be the angle measured in 
radians between —x and a, so Y= (2n/360)X — x. Then, applying the rescaling 
properties with a = 2x/360 and b= —z, 


2n 2n 
E(Y) = -E(X a 1 = 
Ue) 65 Ot) Ee 
and 
2n _ 2x 360 _ 2n 


oy = |e «ox = Saas = a | 
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3.2.2 Moment Generating Functions 


Moments and moment generating functions for discrete random variables were 
introduced in Sect. 2.7. These concepts carry over to the continuous case. 


DEFINITION 
The moment generating function (mgf) of a continuous random variable 
X is 


1o@) 
Mx(t) = E(e™) = | ef (x)dx. 
— Od) 
As in the discrete case, the moment generating function exists iff My(f) 
is defined for an interval that includes zero as well as positive and negative 
values of t. 


Just as before, when t= 0 the value of the mgf is always 1: 


Mx(0) = E(e™) = [- e"f (x)dx = | f(x)dx = 1. 


—oo 


Example 3.16 Ata store, the checkout time X in minutes has the pdf f(x) = de 
x > 0; f(x) = 0 otherwise. Then 


Mx(t) = | ef (x)dx = | ee ax = | 2e~ Py 


=! 2 2 : 
= lim e7 @- 
9 2-t 2= txt00 


2 ; 
ae oe 2A 


The limit above exists (in fact, it equals zero) provided the coefficient on 
x is negative, i.e., —(2 — ft) <0. This is equivalent to t< 2. The mgf exists because 
it is defined for an interval of values including 0 in its interior, specifically (—oo, 2). 
For ¢ in that interval, the mgf of X is My(t) = 2/(2 — 0). 

Notice that My(0) = 2/(2 — 0) = 1. Of course, from the calculation preceding this 
example we know that M,(0) = | must always be the case, but it is useful as a check 
to set t=0 and see if the result is 1. a 


Recall that in Sect. 2.7 we had a uniqueness property for the mgfs of discrete 
distributions. This proposition is equally valid in the continuous case: two 
distributions have the same pdf if and only if they have the same moment 
generating function, assuming that the mgf exists. For example, if a random 
variable X is known to have mgf Mx(t)=2/(2 — t) for t<2, then from Example 
3.16 it must necessarily be the case that the pdf of X is f(x) =2e-** for x >0 and 
f(x) = 0 otherwise. 
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In the discrete case we also had a theorem on how to get moments from the mef, 
and this theorem applies also in the continuous case: the rth moment of a continuous 
rv with mgf My(f) is given by 


E(x’) = My’(0), 
the rth derivative of the mgf with respect to ¢ evaluated at t= 0, if the mgf exists. 


Example 3.17 (Example 3.16 continued) The mgf of the rv X = checkout time at 
the store was found to be My(t) = 2/(2 — t) = 2(2 — 1)! for t <2. To find the mean 
and standard deviation, first compute the derivatives: 


U _ 2 _ 2 
Mx() = 22 )"(-1) = Gs 

uv 7 - 4 
My) =F 22-9] =-42- 9°) =o 


Setting ¢ to 0 in the first derivative gives the expected checkout time as 
E(X) = M (0) = M,(0) = .5 min. 
Setting ¢ to 0 in the second derivative gives the second moment 


E(x) = MY? (0) = Mi(0) = 55, 


from which the variance of the checkout time is Var(X) = o = E(X”) - [E(X) P= 
5—.5°=.25 and the standard deviation is o = V/.25 = .5 min. | 


We will sometimes need to transform X using a linear function Y= aX +b. 
As discussed in the discrete case, if X has the mgf My(t) and Y=aX +b, then 
My(t) = e"My(at). 


Example 3.18 Let X~Unif[A, B]. As verified in Exercise 32, the moment 
generating function of X is 
Bt et 
eee 426 
Mx(t) = 4 (B—A)t 
1 t=0 


In particular, consider the situation in Example 3.15. Let X, the angle measured 
in degrees, be uniform on [0, 360], so A =0 and B = 360. Then 


360 4 


Mx() = 605 


t#0, My(0)=1 

Now let Y= (22/360)X — z, so Y is the angle measured in radians between —1 
and x. Using the mgf rule for linear transformations with a= 22/360 and b= —xn, 
we get 
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2n 
My(t) = e”'Mx (at) = e ™My (su)! 


360(2n/360)t _ 4 


= em e 
7 Qn 
360 (Ss) 
e™! — pol 
2nt re, (0) 


This matches the general form of the moment generating function for a uniform 
random variable with A=—z and B=n. Thus, by the mgf uniqueness property, 
Y ~ Unif[—z, x]. | 


3.2.3. Exercises: Section 3.2 (19-38) 


19. 


20. 


21. 


22. 


Reconsider the distribution of checkout duration X described in Exercise 11. 

Compute the following: 

(a) E(X) 

(b) Var(X) and SD(X) 

(c) If the borrower is charged an amount h(X) =X? when checkout duration 
is X, compute the expected charge E[h(X)]. 

The article “Modeling Sediment and Water Column Interactions for 

Hydrophobic Pollutants” (Water Res., 1984: 1169-1174) suggests the uniform 

distribution on the interval [7.5, 20] as a model for depth (cm) of the 

bioturbation layer in sediment in a certain region. 

(a) What are the mean and variance of depth? 

(b) What is the cdf of depth? 

(c) What is the probability that observed depth is at most 10? Between 10 and 
15? 

(d) What is the probability that the observed depth is within 1 standard 
deviation of the mean value? 
Within 2 standard deviations? 

For the distribution of Exercise 14, 

(a) Compute E(X) and SD(X). 

(b) What is the probability that X is more than | standard deviation from its 
mean value? 

Consider the pdf given in Exercise 6. 

(a) Obtain and graph the cdf of X. 

(b) From the graph of f(x), what is the median, 7? 

(c) Compute E(X) and Var(X). 
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23. 


24. 


25. 


26. 
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Let X ~ Unif[A, B]. 

(a) Obtain an expression for the (100p)th percentile. 

(b) Obtain an expression for the median, 7. How does this compare to the 
mean yz, and why does that make sense for this distribution? 

(c) For 7 a positive integer, compute E(X”). 

Consider the pdf for total waiting time Y for two buses 


1 
ae O0<y<5 
5 ay 
fo)=4 21 
=—- 5<y<10 
5 25 x 
0 otherwise 


introduced in Exercise 8. 

(a) Compute and sketch the cdf of Y. [Hint: Consider separately O<y<5 
and 5<y<10 in computing F(y). A graph of the pdf should be 
helpful.] 

(b) Obtain an expression for the (100p)th percentile. [Hint: Consider sepa- 
rately O0<p<.5 and .5<p<1.] 

(c) Compute E(Y) and Var(Y). How do these compare with the expected 
waiting time and variance for a single bus when the time is uniformly 
distributed on [0, 5]? 

(d) Explain how symmetry can be used to obtain E(Y). 

An ecologist wishes to mark off a circular sampling region having radius 

10 m. However, the radius of the resulting region is actually a random variable 

R with pdf 

3 2 
“[1-(10-r)"]  9<r<ll 
de a es 


0 otherwise 


What is the expected area of the resulting circular region? 
The weekly demand for propane gas (in 1000s of gallons) from a particular 
facility is an rv X with pdf 


poe ftp) 1S¥2 


0 otherwise 


(a) Compute the cdf of X. 

(b) Obtain an expression for the (100p)th percentile. What is the value of the 
median, 77? 

(c) Compute E(X). How do the mean and median of this distribution 
compare? 

(d) Compute Var(X) and SD(X). 
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27. 


28. 


29. 


30. 


31. 


(e) If 1.5 thousand gallons are in stock at the beginning of the week and no 
new supply is due in during the week, how much of the 1.5 thousand 
gallons is expected to be left at the end of the week? [Hint: Let h(x) = 
amount left when demand is x.] 

If the temperature at which a compound melts is a random variable with mean 

value 120°C and standard deviation 2°C, what are the mean temperature and 

standard deviation measured in °F? [Hint: °F = 1.8°C + 32.] 


Let X have the Pareto pdf introduced in Exercise 10: 
kof 
— x«>0 
f(sk,0) = ¢ eT 
0 x«<@ 


(a) If k> 1, compute E(X). 

(b) What can you say about E(X) if k= 1? 

(c) If k>2, show that Var(X) = kO7°(k — 1) 7(k—2) 1. 

(d) If k=2, what can you say about Var(X)? 

(e) What conditions on k are necessary to ensure that E(X”) is finite? 

The time (min) between successive visits to a particular Web site has pdf 
fa= de, x > 0; f(x) =0 otherwise. Use integration by parts to obtain E(X) 
and SD(X). 

Consider the weights, in grams, of walnuts harvested at a nearby farm. 
Suppose this weight distribution can be modeled by the following pdf: 


Su Qeeed 
fy=q> 8 


0) otherwise 


(a) Show that E(X) = 4/3 and Var(X) = 8/9. 

(b) The skewness coefficient is defined as E[(X — )*]/o°. Show that its value 
for the given pdf is .566. What would the skewness be for a perfectly 
symmetric pdf? 

The delta method provides approximations to the mean and variance of 

a nonlinear function h(X) of a rv X. These approximations are based on a 

first-order Taylor series expansion of h(x) about x= y, the mean of X: 


h(X) = hy(X) = hw) +H! (u)(X = p) 
(a) Show that E[h,(X)]=h(y). (This is the delta method approximation to 
E{h(X)].) 


(b) Show that Var[h,(X)] = [h’ (uw) |)’ Var(X ). (This is the delta method approx- 
imation to Var[h(X)].) 
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32. 


33. 


34. 


35. 


36. 


37. 
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(c) If the voltage v across a medium is fixed but current J is random, then 
resistance will also be a random variable related to J by R = v/I. If u;= 20 
and o, =.5, calculate approximations to fp and or. 

(d) Let R have the distribution in Exercise 25, whose mean and variance 
are 10 and 1/5, respectively. Let h(R) =2R?, the area of the ecologist’s 
sampling region. How does E[h(R)] from Exercise 25 compare to the delta 
method approximation (10)? 

(e) It can be shown that Var[h(R)] = 1400827/175. Compute the delta method 
approximation to Var[h(R)] using the formula in (b). How good is the 
approximation? 

Let X ~ Unif[A, B], so its pdf is fx) = 1/((B — A), A<x<B, f(x) =0 otherwise. 

Show that the moment generating function of X is 


eB = et 


Mia Baar 'r? 


1 t=0 


Let X ~Unif[0, 1]. Find a linear function Y= g(X) such that the interval 

[0, 1] is transformed into [—5, 5]. Use the relationship for linear functions 

Moxin() = eM. ‘y(at) to obtain the mgf of Y from the mgf of X. Compare your 

answer with the result of Exercise 32, and use this to obtain the pdf of Y. 

If the pdf of a measurement error X is f(x) = 5e', —0o <x <0, show that 

Mx(t) = 1/(1 —P) for I< 1. 

Consider the rv X = time headway in Example 3.5. 

(a) Find the moment generating function and use it to find the mean and 
variance. 

(b) Now consider a random variable whose pdf is 


he 15e7* x >0 
10) { 0 otherwise 


Find the moment generating function and use it to find the mean and 
variance. Compare with (a), and explain the similarities and differences. 

(c) Let Y=X —.5 and use the relationship for linear functions M,x.,() = 
e”"Mx(at) to obtain the mgf of Y from (a). Compare with the result of 
(b) and explain. 

Define Ly(t)=In[Myx()]. It was shown in Exercise 120 of Chap. 2 that 

Ly (0) = E(X) and Lx (0) = Var(X). 

(a) Determine My(f) for the pdf in Exercise 29, and use this mgf to obtain E 
(X) and Var(X). How does this compare, in terms of difficulty, with the 
integration by parts required in that exercise? 

(b) Determine L(t) for this same distribution, and use L(t) to obtain E(X) 
and Var(X). How does the computational effort here compare with that of 
(a)? 


Let X be a nonnegative, continuous rv with pdf f(x) and cdf F(x). 
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(a) Show that, for any constant ¢ > 0, 


| x-f(x)dx >t-P(X > t)=t-[1—F(d] 
t 
(b) Assume the mean of X is finite (i.e., the integral defining « converges). 
Use part (a) to show that 
lim t- [1 — F(t)] =0 
too 
[Hint: Write the integral for y as the sum of two other integrals, one from 
0 to ¢ and another from ¢ to «.] 
38. Let X be a nonnegative, continuous rv with cdf F(x). 
(a) Assuming the mean yp of X is finite, show that 


p= | (1 Fees 
0 
[Hint: Apply integration by parts to the integral above, and use the result 
of the previous exercise.] This is the continuous analog of the result 
established in Exercise 48 of Chap. 2. 
(b) A similar argument can be used to show that the Ath moment of X is given 
by 


E(x") = an "11 — F(x)Jdx 
0 


and that E(X*) exists iff “[1—F()]—0 as too. (This was the 
topic of a 2012 article in The American Statistician.) Suppose the 
lifetime X, in weeks, of a low-grade transistor under continuous use has 
cdf F(x) =1—(w+ ib ad for x > 0. Without finding the pdf of X, determine 
its mean and its standard deviation. 


3.3 The Normal (Gaussian) Distribution 


The normal distribution, often called the Gaussian distribution by engineers, is the 
most important one in all of probability and statistics. Many numerical populations 
have distributions that can be fit very closely by an appropriate normal curve. 
Examples include heights, weights, and other physical characteristics, measure- 
ment errors in scientific experiments, measurements on fossils, reaction times in 
psychological experiments, measurements of intelligence and aptitude, scores on 
various tests, and numerous economic measures and indicators. Even when the 
underlying distribution is discrete, the normal curve often gives an excellent 
approximation. In addition, even when individual variables themselves are not 
normally distributed, sums and averages of the variables will, under suitable 
conditions, have approximately a normal distribution; this is the content of the 
Central Limit Theorem discussed in Chap. 4. 
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Fig. 3.13 Normal density curves 


DEFINITION 

A continuous rv X is said to have a normal distribution (or Gaussian 
distribution) with parameters and o, where —co <p<oo and o> 0, if 
the pdf of X is 


1 xe 
fs 4,0) = 5 e & 122) —00 <x <0o (3.3) 
oV2n 


The statement that X is normally distributed with parameters y and o is 
often abbreviated X ~ N(y, o). 


Figure 3.13 presents graphs of f(x;u,0) for several different (u, o) pairs. Each 
resulting density curve is symmetric about yw and bell-shaped, so the center of the 
bell (point of symmetry) is both the mean of the distribution and the median. The 
value of o is the distance from y to the inflection points of the curve (the points at 
which the curve changes between turning downward to turning upward). Large 
values of o yield density curves that are quite spread out about 4, whereas small 
values of o yield density curves with a high peak above mw and most of the area 
under the density curve quite close to yw. Thus a large o implies that a value of 
X far from may well be observed, whereas such a value is quite unlikely when o 
is small. 

Clearly f(x; pw, o)>0, but a somewhat complicated calculus argument is 


required to prove that | (x; pw, o)dx = 1 (see Exercise 66). It can be shown 


using calculus (Exercise 67) or moment generating functions (Exercise 68) that 
E(X) = and Var(X) =o*, so the parameters y and o are the mean and the standard 
deviation, respectively, of X. 


3.3.1 The Standard Normal Distribution 


To compute P(a <X <b) when X ~N(u, o), we must evaluate 
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Le y?/ (20?) 
eo #1 (207) ay 3.4 
| oV2n ( ) 


None of the standard integration techniques can be used here, and there is no 
closed-form expression for the integral. Table 3.1 at the end of this section provides 
the code for performing such normal distribution calculations in both Matlab and 
R. For the purpose of hand calculation of normal distribution probabilities, we now 
introduce a special normal distribution. 


DEFINITION 
The normal distribution with parameter values ~=0 and o=1 is called 
the standard normal distribution. A random variable that has a standard 
normal distribution is called a standard normal random variable and will 
be denoted by Z. The pdf of Z is 

1 2 


FON) ee ae —-70o<z<o@ 
1 


1 
The cdf of Z is P(Z < z) = | re *dy, which we will denote by 
—oo TU 


D(z). 


The standard normal distribution does not frequently serve as a model for a 
naturally arising population, since few variables have mean 0 and standard devia- 
tion 1. Instead, it is a reference distribution from which information about other 
normal distributions can be obtained. Appendix Table A.3 gives values of ®(z) for 
z=—3.49, —3.48, ..., 3.48, 3.49 and is referred to as the standard normal table or 
z table. Figure 3.14 illustrates the type of cumulative area (probability) tabulated 
in Table A.3. From this table, various other probabilities involving Z can be 
calculated. 


Shaded area = ®(z) 


Standard normal (z) curve 
a (z) 


| 
0 Zz 


Fig. 3.14 Standard normal cumulative areas tabulated in Appendix Table A.3 


Example 3.19 Here we demonstrate how the z table is used to calculate various 

probabilities involving a standard normal rv. 

(a) P(Z< 1.25) =@(1.25), a probability that is tabulated in Table A.3 at the 
intersection of the row marked 1.2 and the column marked .05. The number 
there is .8944, so P(Z < 1.25) = .8944. See Fig. 3.15a. In Matlab, we may type 
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a b 


Shaded area = ®(1.25) ; 
z curve z curve 


P(Z > 1.25) 


0 1.25 0 1.25 


Fig. 3.15 Normal curve areas (probabilities) for Example 3.19 


cdf(’norm’,1.25,0,1); in R, use pnorm(1.25,0,1) or just 
pnorm(1.25). 

(b) P(Z> 1.25) = 1 —P(Z< 1.25) = 1 — ®(1.25), the area under the standard nor- 
mal curve to the right of 1.25 (an upper-tail area). Since ®(1.25) = .8944, it 
follows that P(Z > 1.25) =.1056. Since Z is a continuous rv, P(Z > 1.25) also 
equals .1056. See Fig. 3.15b. 

(c) P(Z@< —1.25)=@(—1.25), a lower-tail area. Directly from the z table, 
@(— 1.25) =.1056. By symmetry of the normal curve, this is identical to the 
probability in (b). 

(d) P(—.38 <Z< 1.25) is the area under the standard normal curve above the 
interval whose left endpoint is —.38 and whose right endpoint is 1.25. From 
Sect. 3.1, if Z is a continuous rv with cdf F(z), then P(a < Z < b) = F(b) — F(a). 
This gives P(—.38 <Z< 1.25) = ®(1.25) — ®(—.38) = .8944 — .3520 = 5424. 
(See Fig. 3.16.) To evaluate this probability in Matlab, type cdf 
(‘norm’,1.25,0,1)-cdf(‘norm’,-.38,0,1); in R, type pnorm 
(1.25,0,1)-pnorm(-.38,0,1) orjustpnorm(1.25) -pnorm(-.38). 


La. z 1 Oe, 


| | | 
—.38 0 1.25 0 1.25 -.38 0 


Fig. 3.16 P(—.38<Z< 1.25) as the difference between two cumulative areas | 


From Sect. 3.1, we have that the (100p)th percentile of the standard normal 
distribution, for any p between 0 and 1, is the solution to the equation ®(z) =p. 
So, we may write the (100p)th percentile of the standard normal distribution as 
Np = ® '(p). Matlab, R, or the z table can be used to obtain this percentile. 


Example 3.20 The 99th percentile of the standard normal distribution, ®~'(.99), is 
the value on the horizontal axis such that the area under the curve to the left of the 
value is .9900, as illustrated in Fig. 3.17. To solve the “inverse” problem ®(z) = p, the 
standard normal table is used in an inverse fashion: Find in the middle of the table 
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.9900; the row and column in which it lies identify the 99th z percentile. Here .9901 
lies in the row marked 2.3 and column marked .03, so ®(2.33) = .9901 = .99 and the 
99th percentile is approximately z= 2.33. By symmetry, the first percentile is the 
negative of the 99th percentile, so it equals —2.33 (1% lies below the first and above 
the 99th). See Fig. 3.18. 


Fig. 3.17 Finding the 99th Shaded area = .9900 
percentile 


zZ curve 


99th percentile 


Fig. 3.18 The relationship z curve 
between the Ist and 99th 


percentiles Shaded area = .01 


| 


—2.33 = Ist percentile 2.33 = 99th percentile 


To find the 99th percentile in Matlab, use the command icdf 
(‘norm’, .99,0,1); “icdf’ stands for “inverse cumulative distribution func- 
tion,” meaning ® |. In R, qnorm(.99,0,1) or just qnorm(.99) produces 
that same value of roughly z= 2.33. a 


3.3.2 Non-standardized Normal Distributions 


When X ~ N(y, o), probabilities involving X may be computed by “standardizing.” 
A standardized variable has the form (X — )/o. Subtracting y shifts the mean 
from y to zero, and then dividing by o scales the variable so that the standard 
deviation is 1 rather than o. 


PROPOSITION 

If X ~N(p, o), then the “standardized” rv Z defined by 
afSk 

e 


Z 


(continued) 
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has a standard normal distribution. Thus 


resus Paz) =o(*t)-o($) 


oO oO 


and the (100p)th percentile of the N(u, o) distribution is given by 
Np =H +O '(p)-o. 


Conversely, if Z~N(O, 1) and yw and o are constants (with o > 0), then the 
“un-standardized” rv X = y+oZ has a normal distribution with mean yp and 
standard deviation o. 


Proof Let X~N(u, o) and define Z = (X — +)/o as in the statement of the proposi- 
tion. Then the cdf of Z is given by 


Fog) =P(Z <2) 


+z +20 1 


= P(X <p+z0) = | fxs, o)dx = | eo #)"/ (207) ay 


ey -~o OV 2n 


Now make the substitution u = (x — y)/o. The new limits of integration become 
—oo to z, and the differential dx is replaced by o du, resulting in 


F(z) = | : eo Gd = | tg is = O(z) 
-oo OVW 20 co V20 
Thus, the cdf of (X —)/o is the standard normal cdf, which establishes that 
(X — p)/o~N(O, 1). 
The probability formulas in the statement of the proposition follow directly from 
this main result, as does the formula for the (100p)th percentile: 
xX 


p=P(X<n,) =p(*=# <b) = 0 (V4) 3 BK oy) 5 
(oy oO oO (oy 


Ny = hr &'(p) °O 
The converse statement Z~ N(0, 1) => 4«+oZ~N(y, o) is derived similarly. 


The key idea of this proposition is that by standardizing, any probability 
involving X can be expressed as a probability involving a standard normal rv Z, 
so that the z table can be used. This is illustrated in Fig. 3.19. 
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N(O,1) 


N(u, 0) 


u x. 0 | 
O- Wie 


Fig. 3.19 Equality of nonstandard and standard normal curve areas 


Software eliminates the need for standardizing X, although the standard normal 
distribution is still important in its own right. Table 3.1 at the end of this section 
details the relevant R and Matlab commands, which are also illustrated in the 
following examples. 


Example 3.21 The time that it takes a driver to react to the brake lights 
on a decelerating vehicle is critical in avoiding rear-end collisions. The article 
“Fast-Rise Brake Lamp as a Collision-Prevention Device” (Ergonomics, 1993: 
391-395) suggests that reaction time for an in-traffic response to a brake signal 
from standard brake lights can be modeled with a normal distribution having 
mean value 1.25 s and standard deviation of .46 s. What is the probability that 
reaction time is between 1.00 s and 1.75 s? If we let X denote reaction time, 
then standardizing gives 1.00 < X < 1.75 if and only if 


1.00—1.25 )X—1.25 | 1.75 — 1.25 
< < 
.46 ~ 46 — 46 


The middle expression, by the previous proposition, is a standard normal 
rv. Thus 


1.00 — 1.2 1.75 — 1.2 
P(1.00< X < 1.75) = P( “= S<z<-t | 
= P(—.54 < Z < 1.09) = (1.09) — &(—.54) 
.8621 — .2946 = .5675 


This is illustrated in Fig. 3.20. The same answer may be produced in Matlab 
with the command cdf (’norm’,1.75,1.25, .46)-cdf(’norm’,1.00, 
1.25, .46); Matlab gives the answer .5681, which is more accurate than the value 
.5675 above (due to rounding the z-values to two decimal places). The analogous R 
command is pnorm(1.75,1.25, .46)-pnorm(1.00,1.25, .46). 

Similarly, if we view 2 s as a critically long reaction time, the probability that 
actual reaction time will exceed this value is 
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Normal, uw = 1.25, o = .46 P(1.00 < X < 1.75) 


Z curve 


[= | 


1.00 1.75 


Fig. 3.20 Normal curves for Example 3.21 


jay) 
P(X > 2) =P(z > a) = P(Z > 1.63) = 1 — ®(1.63) = .0516 


This probability is determined in Matlab and R with the commands 1-cdf 
(‘norm’,2,1.25,.46) and1-pnorm(2,1.25, .46), respectively. = 


Example 3.22 The amount of distilled water dispensed by a machine is normally 
distributed with mean value 64 oz and standard deviation .78 oz. What container 
size c will ensure that overflow occurs only .5% of the time? If X denotes the 
amount dispensed, the desired condition is that P(X >c)=.005, or, equivalently, 
that P(X < c) = .995. Thus c is the 99.5th percentile of the normal distribution with 
p=64 and o=.78. The 99.5th percentile of the standard normal distribution is 
@~'(.995) = 2.58, so 
C = Nogs = 64 + (2.58)(.78) = 64 + 2.0 = 66.0 oz 


This is illustrated in Fig. 3.21. 
Fig. 3.21 Distribution of Shaded area = .995 


amount dispensed for 
Example 3.22 


w= 64 


c = 99.5th percentile = 66.0 


The relevant Matlab and R commands are icdf£(’norm’, .995,64,.78) 
and qnorm(.995,64, .78), respectively. a 


Standardizing amounts to nothing more than calculating a distance from the 
mean and then reexpressing the distance as some number of standard 
deviations. For example, if ~=100 and o=15, then x=130 corresponds to 
z= (130 — 100)/15 = 30/15=2.00. That is, 130 is 2 standard deviations 
above (to the right of) the mean value. Similarly, standardizing 85 gives 
(85 — 100)/15=-—1.00, so 85 is 1 standard deviation below the mean. 
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The z table applies to any normal distribution provided that we think in terms of 
number of standard deviations away from the mean value. 


Example 3.23 The return on a diversified investment portfolio is normally 
distributed. What is the probability that the return is within 1 standard deviation 
of its mean value? This question can be answered without knowing either yu or o, as 
long as the distribution is known to be normal; in other words, the answer is the 
same for any normal distribution. Going one standard deviation below yu lands us at 
pu—o, while +o is one standard deviation above the mean. Thus 


=P(u-o <X <pto) 


=p(#-2o Bez cttent) 
Oo oO 


X is within one standard 
deviation of its mean 


= P(-1<Z<1) 
= (1) — O(-1) = .6826 


The probability that X is within 2 standard deviations of the mean is 
P(—2 <Z<2)=.9544 and the probability that X is within 3 standard deviations 
of the mean is P(-3 <Z< 3) = .9973. | 


The results of Example 3.23 are often reported in percentage form and referred 
to as the empirical rule (because empirical evidence has shown that histograms of 
real data can very frequently be approximated by normal curves). 


EMPIRICAL RULE 

If the population distribution of a variable is (approximately) normal, then 
1. Roughly 68% of the values are within 1 SD of the mean. 

2. Roughly 95% of the values are within 2 SDs of the mean. 

3. Roughly 99.7% of the values are within 3 SDs of the mean. 


3.3.3. The Normal MGF 


The moment generating function provides a straightforward way to establish 
several important results concerning the normal distribution. 


PROPOSITION 
The moment generating function of a normally distributed random variable 
X is 

Mx(t) = elton /2 
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Proof Consider first the special case of a standard normal rv Z. Then 


cae one 2 ill 2 
M7(t) = E(e’2) = | ee 7 Pde -| Pa —212)/2 4, 
2( ( ) —0co V 2n —oo 2n 


Completing the square in the exponent, we have 
~ eo (P-2etP)/24, _ én 
-o V2n 


Sd eg, 


_ t/ 
Ma) =e"? | fom 


The last integral is the area under a normal density curve with mean ¢ and 
standard deviation 1, so the value of the integral is 1. Therefore, Mz(t) = e*/?. 

Now let X be any normal rv with mean yw and standard deviation o. 
Then, by the proposition earlier in this section, (X—y)/o=Z, where Z is 
standard normal. Rewrite this relationship as X =j1+oZ, and use the property 
Mays0(t) = e"My(at): 


Mx(t) = My+oz(t) = e"Mz(ot) = eltt prt /2 _ olttoer? /2 rt 


The normal mgf can be used to establish that 4 and o are indeed the mean 
and standard deviation of X, as claimed earlier (Exercise 68). Also, by the mgf 
uniqueness property, any rv X whose moment generating function has the form 
specified above is necessarily normally distributed. For example, if it is known that 
the mgf of X is My(t) =e”, then X must be a normal rv with mean p=0 and 
standard deviation o = 4 (since the N(0, 4) distribution has e*” as its mef). 

It was established earlier in this section that if X ~N(u, o) and Z= (X — p)/o, then 
Z~N(O, 1), and vice versa. This standardizing transformation is actually a special 
case of a much more general property. 


PROPOSITION 
Let X~N(u, o). Then for any constants a and b with a#0, aX +b is also 
normally distributed. That is, any linear rescaling of a normal rv is normal. 


The proof of this proposition uses mgfs and is left as an exercise (Exercise 70). 
This proposition provides a much easier proof of the earlier relationship between 
X and Z. The rescaling formulas and this proposition combine to give the following 
statement: if X is normally distributed and Y= aX + b (a 40), then Y is also normal, 
with mean py =ayx +b and standard deviation oy = lalox. 
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3.3.4 The Normal Distribution and Discrete Populations 


The normal distribution is often used as an approximation to the distribution of 
values in a discrete population. In such situations, extra care must be taken to ensure 
that probabilities are computed in an accurate manner. 


Example 3.24 IQ (as measured by a standard test) is known to be approximately 
normally distributed with ~=100 and o=15. What is the probability that a 
randomly selected individual has an IQ of at least 125? Letting X= the IQ of a 
randomly chosen person, we wish P(X > 125). The temptation here is to standardize 
X > 125 immediately as in previous examples. However, the IQ population is 
actually discrete, since IQs are integer-valued, so the normal curve is an approxi- 
mation to a discrete probability histogram, as pictured in Fig. 3.22. 

The rectangles of the histogram are centered at integers, so IQs of at least 
125 correspond to rectangles beginning at 124.5, as shaded in Fig. 3.22. Thus we 
really want the area under the approximating normal curve to the right of 124.5. 
Standardizing this value gives P(Z > 1.63) = .0516. If we had standardized X > 125, 
we would have obtained P(Z > 1.67) = .0475. The difference is not great, but the 
answer .0516 is more accurate. Similarly, P(X = 125) would be approximated by 
the area between 124.5 and 125.5, since the area under the normal curve above the 
single value 125 is zero. 


125 


Fig. 3.22 A normal approximation to a discrete distribution a 


The correction for discreteness of the underlying distribution in Example 3.24 is 
often called a continuity correction; it adjusts for the use of a continuous distribu- 
tion in approximating a probability involving a discrete rv. It is useful in the 
following application of the normal distribution to the computation of binomial 
probabilities. The normal distribution was actually created as an approximation to 
the binomial distribution (by Abraham de Moivre in the 1730s). 
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a b 
0.30 0.30 
normal curve, 
0.25 0.25 | 2, o= 1.34 
0.20 normal curve, 0.20 
w= 12, c=2.19 
0.15 0.15 
0.10 0.10 
0.05 4 0.05 = 
0.00 = 0.00 
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20 


Fig. 3.23 Binomial probability histograms with normal approximation curves superimposed: 
(a) n= 20 and p=.6 (a good fit); (b) n= 20 and p=.1 (a poor fit) 


3.3.5 Approximating the Binomial Distribution 


Recall that the mean value and standard deviation of a binomial random variable X 
are # = np ando = JnpPq, respectively. Figure 3.23a displays a probability histogram 
for the binomial distribution with n=20, p=.6 [so w=20(.6)=12 and 
o = \/20(.6)(.4) = 2.19]. A normal curve with mean value and standard deviation 
equal to the corresponding values for the binomial distribution has been superimposed 
on the probability histogram. Although the probability histogram is a bit 
skewed (because p#.5), the normal curve gives a very good approximation, 
especially in the middle part of the picture. The area of any rectangle (probability 
of any particular X value) except those in the extreme tails can be accurately 
approximated by the corresponding normal curve area. For example, 


P(X =10)= (io) -6)'9(.4)"° = .117, whereas the area under the normal curve 


between 9.5 and 10.5 is P(—1.14<Z< —.68) = .120. 

On the other hand, a normal distribution is a poor approximation to a discrete 
distribution that is heavily skewed. For example, Figure 3.23b shows a probability 
histogram for the Bin(20, .1) distribution and the normal pdf with the same 
mean and standard deviation (“= 2 and o = 1.34). Clearly, we would not want to 
use this normal curve to approximate binomial probabilities, even with a continuity 
correction. 


PROPOSITION 

Let X be a binomial rv based on n trials with success probability p. Then 
if the binomial probability histogram is not too skewed, X has approximately 
a normal distribution with ~=np and o = ,/npq. In particular, for x=a 
possible value of X, 


(continued) 
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P(X <x) =B(x; n, p) © (area under the normal curve to the left of x + .5) 


_o (: +.5— 2) 
Vv"pq 
In practice, the approximation is adequate provided that both np > 10 and 
nq = 10. 


If either np < 10 or ng < 10, the binomial distribution may be too skewed for the 
(symmetric) normal curve to give accurate approximations. 


Example 3.25 Suppose that 25% of all licensed drivers in a state do not have 
insurance. Let X be the number of uninsured drivers in a random sample of size 
50 (somewhat perversely, a success is an uninsured driver), so that p= .25. Then 
w=12.5 and o=3.062. Since np =50(.25) = 12.5> 10 and nq=37.5 > 10, the 
approximation can safely be applied: 


P(X < 10) = B(10; 50, .25) + ® 
Fe Wa BIN) ( 3.062 


= ®(—.6532) = .2568 


10+ .5— — 


Similarly, the probability that between 5 and 15 (inclusive) of the selected 
drivers are uninsured is 


P(5 <X < 15) = B(15; 50, .25) — B(4; 50, .25) 
wo 5 = 125) 9f 45-125) _ esi 
3.062 3.062 


The exact probabilities are .2622 and .8348, respectively, so the approximations 
are quite good. In the last calculation, the probability P(S<X< 15) is being 
approximated by the area under the normal curve between 4.5 and 15.5—the 
continuity correction is used for both the upper and lower limits. a 


The wide availability of software for doing binomial probability calculations, 
even for large values of n, has considerably diminished the importance of the 
normal approximation. However, it is important for another reason. When the 
objective of an investigation is to make an inference about a population proportion 
Pp, interest will focus on the sample proportion of successes P=xX /n rather than 
on X itself. Because this proportion is just X multiplied by the constant 1/n, the 
earlier rescaling proposition tells us it will also have approximately a normal 
distribution (with mean «=p and standard deviation o = ,/pq/n) provided that 
both np > 10 and ng> 10. This normal approximation is the basis for several 
inferential procedures to be discussed in Chap. 5. 

It is quite difficult to give a direct proof of the validity of this normal approxima- 
tion (the first one goes back about 270 years to de Moivre). In Chap. 4, we’ll see that it 
is a consequence of an important general result called the Central Limit Theorem. 
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3.3.6 Normal Distribution Calculations with Software 


Many software packages, including Matlab and R, have built-in functions to 
determine both probabilities under a normal curve and quantiles (aka percentiles) 
of any given normal distribution. Table 3.1 summarizes the relevant code in both 
packages. 


Table 3.1 Normal probability and quantile calculations in Matlab and R 


Function: cdf quantile, i.e., the (100p)th percentile 
Notation: o( —) Np =H+® '(p)-6 
oO 
Matlab: cdf (‘norm’ ,x,{l,0) icdf(’norm’ ,p,p,o) 
R: pnorm(x,M,o) gnorm (p,l,o) 


In the special case of a standard normal distribution, R (but not Matlab) will 
allow the user to drop the last two arguments, yw and o. That is, the R commands 
pnorm(x) and pnorm(x,0,1) yield the same result for any number x, and a 
similar comment applies to qnorm. Both software packages also have built-in 
function calls for the normal pdf: pdf (’norm’,x,u,o) and dnorm(x,y,o), 
respectively. However, these two commands are generally only used when one 
desires to graph a normal density curve (x vs. f(x; 4, o)), since the pdf evaluated at 
particular x does not represent a probability, as discussed in Sect. 3.1. 


3.3.7. Exercises: Section 3.3 (39-70) 


39. Let Z be a standard normal random variable and obtain each of the following 
probabilities, drawing pictures wherever appropriate. 
(a) PO<Z<2.17) 
(b) PO<Z<1) 
(c) P(-2.50<Z<0) 
(d) P(-2.50<Z<2.50) 
(e) P(Z<1.37) 
(f) P(-1.75 <Z) 
(g) P(-1.50<Z<2.00) 
(h) P(..37<Z<2.50) 
(i) P(1.50<Z) 
G) PdZIl<2.50) 
40. In each case, determine the value of the constant c that makes the probability 
statement correct. 
(a) P(c)=.9838 
(b) PO<Z<c)=.291 
(c) P(c<Z)=.121 
(d) P(-c SZ <c)=.668 
(e) P(c SIZI)=.016 
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41. 


42. 


43. 


44. 


Find the following percentiles for the standard normal distribution. Interpolate 

where appropriate. 

(a) 91st 

(b) 9th 

(c) 75th 

(d) 25th 

(e) 6th 

Suppose the force acting on a column that helps to support a building is a 

normally distributed random variable X with mean value 15.0 kips and 

standard deviation 1.25 kips. Compute the following probabilities. 

(a) P(X <15) 

(b) P(X < 17.5) 

(c) P(XX= 10) 

(d) P4<X< 18) 

(e) P(X—15|<3) 

Mopeds (small motorcycles with an engine capacity below 50 cc) are 

very popular in Europe because of their mobility, ease of operation, and 

low cost. The article “Procedure to Verify the Maximum Speed of Auto- 

matic Transmission Mopeds in Periodic Motor Vehicle Inspections” (J. of 

Automobile Engr., 2008: 1615-1623) described a rolling bench test for deter- 

mining maximum vehicle speed. A normal distribution with mean value 

46.8 km/h and standard deviation 1.75 km/h is postulated. Consider randomly 

selecting a single such moped. 

(a) What is the probability that maximum speed is at most 50 km/h? 

(b) What is the probability that maximum speed is at least 48 km/h? 

(c) What is the probability that maximum speed differs from the mean value 
by at most 1.5 standard deviations? 

Let X be the birth weight, in grams, of a randomly selected full-term baby. The 

article “Fetal Growth Parameters and Birth Weight: Their Relationship to 

Neonatal Body Composition” (Ultrasound in Obstetrics and Gynecology, 

2009: 441-446) suggests that X is normally distributed with mean 3500 and 

standard deviation 600. 

(a) Sketch the relevant density curve, including tick marks on the horizontal 
scale. 

(b) What is P(3000<xX< 4500), and how does this compare to 
P(3000 < X < 4500)? 

(c) What is the probability that the weight of such a newborn is less than 
2500 g? 

(d) What is the probability that the weight of such a newborn exceeds 6000 g 
(roughly 13.2 lb)? 

(e) How would you characterize the most extreme .1% of all birth weights? 

(f) Use the rescaling proposition from this section to determine the distribu- 
tion of birth weight expressed in pounds (shape, mean, and standard 
deviation), and then recalculate the probability from part (c). How does 
this compare to your previous answer? 
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45. 


46. 


47. 


48. 
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Based on extensive data from an urban freeway near Toronto, Canada, “‘it is 

assumed that free speeds can best be represented by a normal distribution” 

(“Impact of Driver Compliance on the Safety and Operational Impacts of 

Freeway Variable Speed Limit Systems,” J. of Transp. Engr., 2011: 260- 

268). The mean and standard deviation reported in the article were 119 km/h 

and 13.1 km/h, respectively. 

(a) What is the probability that the speed of a randomly selected vehicle is 
between 100 and 120 km/h? 

(b) What speed characterizes the fastest 10% of all speeds? 

(c) The posted speed limit was 100 km/h. What percentage of vehicles was 
traveling at speeds exceeding this posted limit? 

(d) If five vehicles are randomly and independently selected, what is the 
probability that at least one is not exceeding the posted speed limit? 

(e) What is the probability that the speed of a randomly selected vehicle 
exceeds 70 miles/h? 

The defect length of a corrosion defect in a pressurized steel pipe is normally 

distributed with mean value 30 mm and standard deviation 7.8 mm (suggested 

in the article “Reliability Evaluation of Corroding Pipelines Considering 

Multiple Failure Modes and Time-Dependent Internal Pressure,” J. of Infra- 

structure Systems, 2011: 216-224). 

(a) What is the probability that defect length is at most 20 mm? Less than 
20 mm? 

(b) What is the 75th percentile of the defect length distribution, i.e., the value 
that separates the smallest 75% of all lengths from the largest 25%? 

(c) What is the 15th percentile of the defect length distribution? 

(d) What values separate the middle 80% of the defect length distribution 
from the smallest 10% and the largest 10%? 

The plasma cholesterol level (mg/dL) for patients with no prior evidence of 

heart disease who experience chest pain is normally distributed with mean 

200 and standard deviation 35. Consider randomly selecting an individual of 

this type. What is the probability that the plasma cholesterol level 

(a) Is at most 250? 

(b) Is between 300 and 400? 

(c) Differs from the mean by at least 1.5 standard deviations? 

Suppose the diameter at breast height (in.) of trees of a certain type is 

normally distributed with ~=8.8 and o=2.8, as suggested in the article 

“Simulating a Harvester-Forwarder Softwood Thinning” (Forest Products 


J., May 1997: 36-41). 


(a) What is the probability that the diameter of a randomly selected tree will 
be at least 10 in.? Will exceed 10 in.? 

(b) What is the probability that the diameter of a randomly selected tree will 
exceed 20 in.? 

(c) What is the probability that the diameter of a randomly selected tree will 
be between 5 and 10 in.? 

(d) What value c is such that the interval (8.8 —c, 8.8+c) includes 98% of 
all diameter values? 
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49. 


50. 


51. 


52. 


53. 


54. 


DD. 


(e) If four trees are independently selected, what is the probability that at least 
one has a diameter exceeding 10 in.? 

There are two machines available for cutting corks intended for use in wine 

bottles. The first produces corks with diameters that are normally distributed 

with mean 3 cm and standard deviation .1 cm. The second machine produces 
corks with diameters that have a normal distribution with mean 3.04 cm and 
standard deviation .02 cm. Acceptable corks have diameters between 2.9 and 

3.1 cm. Which machine is more likely to produce an acceptable cork? 

Human body temperatures for healthy individuals have approximately a 

normal distribution with mean 98.25 °F and standard deviation .75 °F. (The 

past accepted value of 98.6 °F was obtained by converting the Celsius value of 
37°, which is correct to the nearest integer.) 

(a) Find the 90th percentile of the distribution. 

(b) Find the 5th percentile of the distribution. 

(c) What temperature separates the coolest 25% from the others? 

The article “Monte Carlo Simulation—Tool for Better Understanding of 

LRFD” (J. Struct. Engr., 1993: 1586-1599) suggests that yield strength 

(ksi) for A36 grade steel is normally distributed with «= 43 and o= 4.5. 

(a) What is the probability that yield strength is at most 40? Greater than 60? 

(b) What yield strength value separates the strongest 75% from the others? 

The automatic opening device of a military cargo parachute has been designed 

to open when the parachute is 200 m above the ground. Suppose opening 

altitude actually has a normal distribution with mean value 200 m and standard 
deviation 30 m. Equipment damage will occur if the parachute opens at an 
altitude of less than 100 m. What is the probability that there is equipment 
damage to the payload of at least one of five independently dropped parachutes? 

The temperature reading from a thermocouple placed in a _ constant- 

temperature medium is normally distributed with mean p, the actual tempera- 

ture of the medium, and standard deviation o. What would the value of o have 

to be to ensure that 95% of all readings are within .1° of 4? 

Vehicle speed on a particular bridge in China can be modeled as normally 

distributed (“Fatigue Reliability Assessment for Long-Span Bridges under 

Combined Dynamic Loads from Winds and Vehicles,” J. of Bridge Engr., 

2013: 735-747). 

(a) If 5% of all vehicles travel less than 39.12 mph and 10% travel more than 
73.24 mph, what are the mean and standard deviation of vehicle speed? 
[Note: The resulting values should agree with those given in the cited 
article. ] 

(b) What is the probability that a randomly selected vehicle’s speed is 
between 50 and 65 mph? 

(c) What is the probability that a randomly selected vehicle’s speed exceeds 
the speed limit of 70 mph? 

If adult female heights are normally distributed, what is the probability that 

the height of a randomly selected woman is 

(a) Within 1.5 SDs of its mean value? 

(b) Farther than 2.5 SDs from its mean value? 
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57. 


58. 


59. 


60. 


61. 


62. 
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(c) Between | and 2 SDs from its mean value? 

A machine that produces ball bearings has initially been set so that the true 

average diameter of the bearings it produces is .500 in. A bearing is acceptable 

if its diameter is within .004 in. of this target value. Suppose, however, that the 
setting has changed during the course of production, so that the bearings have 
normally distributed diameters with mean value .499 in. and standard devia- 
tion .002 in. What percentage of the bearings produced will not be acceptable? 

The Rockwell hardness of a metal is determined by impressing a hardened 

point into the surface of the metal and then measuring the depth of penetration 

of the point. Suppose the Rockwell hardness of an alloy is normally 
distributed with mean 70 and standard deviation 3. (Rockwell hardness is 
measured on a continuous scale.) 

(a) If a specimen is acceptable only if its hardness is between 67 and 75, what 
is the probability that a randomly chosen specimen has an acceptable 
hardness? 

(b) If the acceptable range of hardness is (70 —c, 70+c), for what value of 
c would 95% of all specimens have acceptable hardness? 

(c) If the acceptable range is as in part (a) and the hardness of each of ten 
randomly selected specimens is independently determined, what is the 
expected number of acceptable specimens among the ten? 

(d) What is the probability that at most eight of ten independently selected 
specimens have a hardness of less than 73.84? [Hint: Y=the number 
among the ten specimens with hardness less than 73.84 is a binomial 
variable; what is p?] 

The weight distribution of parcels sent in a certain manner is normal with mean 

value 12 1b and standard deviation 3.5 lb. The parcel service wishes to establish 

a weight value c beyond which there will be a surcharge. What value of c is 

such that 99% of all parcels are at least 1 1b under the surcharge weight? 

Suppose Appendix Table A.3 contained ®(z) only for z > 0. Explain how you 

could still compute 

(a) P(—1.72 <Z< —.55) 

(b) P(—1.72 <Z<.55) 

Is it necessary to table ®(z) for z negative? What property of the standard 

normal curve justifies your answer? 

Chebyshev’s inequality (Sect. 3.2) states that for any number & satisfying 

k>1, P(X — pl > ko) is no more than 1/k*. Obtain this probability in the case 

of a normal distribution for k = 1, 2, and 3, and compare to Chebyshev’s upper 
bound. 

Let X denote the number of flaws along a 100-m reel of magnetic tape 

(an integer-valued variable). Suppose X has approximately a normal distribu- 

tion with ~=25 and o=5. Use the continuity correction to calculate the 

probability that the number of flaws is 

(a) Between 20 and 30, inclusive. 

(b) At most 30. Less than 30. 

Let X have a binomial distribution with parameters n= 25 and p. Calculate 

each of the following probabilities using the normal approximation (with the 
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63. 


64. 


65. 


66. 


continuity correction) for the cases p= .5, .6, and .8 and compare to the exact 
probabilities calculated from Appendix Table A.1. 
(a) PAS <X < 20) 
(b) P(X < 15) 
(c) PQ20<X) 
Suppose that 10% of all steel shafts produced by a process are nonconforming 
but can be reworked (rather than having to be scrapped). Consider a random 
sample of 200 shafts, and let X denote the number among these that are 
nonconforming and can be reworked. What is the (approximate) probability 
that X is 
(a) At most 30? 
(b) Less than 30? 
(c) Between 15 and 25 (inclusive)? 
Suppose only 70% of all drivers in a state regularly wear a seat belt. A random 
sample of 500 drivers is selected. What is the probability that 
(a) Between 320 and 370 (inclusive) of the drivers in the sample regularly 

wear a seat belt? 
(b) Fewer than 325 of those in the sample regularly wear a seat belt? Fewer 

than 315? 
In response to concerns about nutritional contents of fast foods, McDonald’s 
announced that it would use a new cooking oil for its french fries that would 
decrease substantially trans fatty acid levels and increase the amount of more 
beneficial polyunsaturated fat. The company claimed that 97 out of 100 people 
cannot detect a difference in taste between the new and old oils. Assuming 
that this figure is correct (as a long-run proportion), what is the approximate 
probability that in a random sample of 1000 individuals who have purchased 
fries at McDonald’s, 
(a) At least 40 can taste the difference between the two oils? 
(b) At most 5% can taste the difference between the two oils? 
The following proof that the normal pdf integrates to 1 comes courtesy of 
Professor Robert Young, Oberlin College. Let f(z) denote the standard normal 
pdf, and consider the function of two variables 

y) =f) fy) = eet elev alee 
g(x,y) =F) FO) = Fee" Pee"? = eH) 


Let V denote the volume under g(x, y) above the xy-plane. 
(a) Let A denote the area under the standard normal curve. By setting up the 
double integral for the volume underneath g(x, y), show that V=A’. 
(b) Using the rotational symmetry of g(x, y), V can be determined by adding 
up the volumes of shells from rotation about the y-axis: 
V= | 2ar - ge 
0 20 


Show this integral equals 1, then use (a) to establish that the area under the 
standard normal curve is 1. 
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(c) Show that | ae SOs pw, o)dx = 1. [Hint: Write out the integral, and then 
make a substitution to reduce it to the standard normal case. Then invoke 
(b).] 

67. Suppose X ~ N(y, o). 

(a) Show via integration that E(X)=y. [Hint: Make the substitution 
u = (x—p)/o, which will create two integrals. For one, use the symmetry 
of the pdf; for the other, use the fact that the standard normal pdf 
integrates to 1.] 

(b) Show via integration that Var(X) =o". [Hint: Evaluate the integral for 
E(x iy" rather than using the variance shortcut formula. Use the same 
substitution as in part (a).] 

68. The moment generating function can be used to find the mean and variance of 
the normal distribution. 

(a) Use derivatives of My(f) to verify that E(X) = mu and Var(X) = oO. 

(b) Repeat (a) using Ly(t) = In[My(4)], and compare with part (a) in terms of 
effort. (Refer back to Exercise 36 for properties of the function Ly(*).) 

69. There is no nice formula for the standard normal cdf ®(z), but several good 
approximations have been published in articles. The following is from 

“Approximations for Hand Calculators Using Small Integer Coefficients” 

(Math. Comput., 1977: 214-222). For 0<z<5.5, 


83z + 351)z + 562 
(703/z) + 165 


P(Z>2)=1-O() x Sexp{ | 


The relative error of this approximation is less than .042%. Use this to 
calculate approximations to the following probabilities, and compare when- 
ever possible to the probabilities obtained from Appendix Table A.3. 

(a) PZ> 1) 

(b) P(Z< —3) 

(c) P(-4<Z<4) 
(d) P(Z>5) 

70. (a) Use mgfs to show that if X has a normal distribution with parameters 
pw and o, then Y=aX +b (a linear function of X) also has a normal 
distribution. What are the parameters of the distribution of Y [i.e., E(Y) 
and SD(Y)]? 

(b) If when measured in °C, temperature is normally distributed with mean 
115 and standard deviation 2, what can be said about the distribution of 
temperature measured in °F? 


3.4 The Exponential and Gamma Distributions 


The graph of any normal pdf is bell-shaped and thus symmetric. In many practical 
situations, the variable of interest to the experimenter might have a skewed distri- 
bution. A family of pdfs that yields a wide variety of skewed distributional shapes is 
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the gamma family. We first consider a special case, the exponential distribution, 
and then generalize later in the section. 


3.4.1 The Exponential Distribution 


The family of exponential distributions provides probability models that are widely 
used in engineering and science disciplines. 


DEFINITION 
X is said to have an exponential distribution with parameter A (A > 0) if the 


pdf of X is 


mica a0 
nen De { 0 otherwise 


Some sources write the exponential pdf in the form (1/f)e~””, so that B= 1/A. 
Graphs of several exponential pdfs appear in Fig. 3.24. 
The expected value of an exponentially distributed random variable X is 


E(X) -| x de dx 
0 
Obtaining this expected value requires integration by parts. The variance of 
X can be computed using the shortcut formula Var(X) = E(X”) — [E(X) |’; 
evaluating E(X”) uses integration by parts twice in succession. In contrast, the 
exponential cdf is easily obtained by integrating the pdf. The results of these 
integrations are as follows. 
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Fig. 3.24 Exponential density curves 
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PROPOSITION 
Let X be an exponential variable with parameter 2. Then the cdf of X is 


0 x <0 
Fe | ge x>0 


The mean and standard deviation of X are both equal to 1/1. 


Under the alternative parameterization, the exponential cdf becomes 1 — e/” 


for x > 0, and the mean and standard deviation are both equal to /. 


Example 3.23. The response time X at an on-line computer terminal (the elapsed time 
between the end of a user’s inquiry and the beginning of the system’s response to that 
inquiry) has an exponential distribution with expected response time equal to 5 s. Then 
E(X) = 1/A=5, so A=.2. The probability that the response time is at most 10 s is 


P(X < 10) = F(10;2) = 1— e~(90% = 1-—e? = 1-135 = 865 
The probability that response time is between 5 and 10 s is 


P(5 <X < 10) = F(10;2) — F(5;2) = (1-e’) -(1-e')=.233 


The exponential distribution is frequently used as a model for the distribution 
of times between the occurrence of successive events, such as customers arriving 
at a service facility or calls coming in to a call center. The reason for this is that 
the exponential distribution is closely related to the Poisson distribution introduced 
in Chap. 2. We will explore this relationship fully in Sect. 7.5 (Poisson Processes). 

Another important application of the exponential distribution is to model 
the distribution of component lifetimes. A partial reason for the popularity of 
such applications is the “memoryless” property of the exponential distribution. 
Suppose component lifetime is exponentially distributed with parameter A. After 
putting the component into service, we leave for a period of fg hours and then return 
to find the component still working; what now is the probability that it lasts 
at least an additional ¢ hours? In symbols, we wish P(X >ft+ fo | X >to). By the 
definition of conditional probability, 

PU(X > t+) A(X = t)| 
P (X > to) 


P(X >t+ |X >%) = 


But the event X > fp in the numerator is redundant, since both events can occur 
if and only if X > t+ fo. Therefore, 
P(X>t+t) 1—F(t+t;34) et) 
P(X = to) ~  1- F(to; 4) en Ato 


P(X >t+ |X >t) = 


This conditional probability is identical to the original probability P(X > #) that 
the component lasted ¢ hours. Thus the distribution of additional lifetime is exactly 
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the same as the original distribution of lifetime, so at each point in time the 
component shows no effect of wear. In other words, the distribution of remaining 
lifetime is independent of current age (we wish that were true of us!). 

Although the memoryless property can be justified at least approximately in 
many applied problems, in other situations components deteriorate with age or 
occasionally improve with age (at least up to a certain point). More general lifetime 
models are then furnished by the gamma, Weibull, and lognormal distributions 
(the latter two are discussed in the next section). Lifetime distributions are at the 
heart of reliability models, which we’ll consider in depth in Sect. 4.8. 


3.4.2. The Gamma Distribution 


To define the family of gamma distributions, which generalizes the exponential 
distribution, we first need to introduce a function that plays an important role in 
many branches of mathematics. 


DEFINITION 
For a > 0, the gamma function I(qa) is defined by 
T(a) = | ae ears 
0 


The most important properties of the gamma function are the following: 
1. For any a> 1, [(a) =(a— 1)-T(a— 1) (via integration by parts) 
2. For any positive integer n, [(n) =(n— 1)! 


3.18) = ve 


The following proposition will prove useful for several computations that follow. 


PROPOSITION 
For any a, B >0, 


| x7-1e-*/Padx = B°T(a) (3.5) 
0 


Proof Make the substitution u =x/f, so that x = fu and dx=f du: 
| ae ee = | (Bu)*'e“Bdu = p*| le de = PT (a) 
0 0 0 
The last equality comes from the definition of the gamma function. a 


With the preceding proposition in mind, we make the following definition. 
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DEFINITION 
A continuous random variable X is said to have a gamma distribution if the 
pdf of X is 

1 a-| 4—x/B x>0 


f(@2,8) = 4 P(e) (3.6) 


0) otherwise 


where the parameters a@ and f satisfy a>0, 6 >0. When f= 1, X is said to 
have a standard gamma distribution, and its pdf may be denoted f(x; a). 


The exponential distribution results from taking a= 1 and f= 1/2. 

It’s clear that f(x; a, #) > 0 for all x; the previous proposition guarantees that this 
function integrates to 1, as required. Figure 3.25a illustrates the graphs of the 
gamma pdf for several (a, /) pairs, whereas Fig. 3.25b presents graphs of the 
standard gamma pdf. For the standard pdf, when a < 1, f(x; a) is strictly decreasing 
as x increases; when a > 1, f(x; a) rises to a maximum and then decreases. Because 
of this difference, a is referred to as a shape parameter. The parameter / in Eq. (3.6) 
is called the scale parameter because values other than 1 either stretch or compress 
the pdf in the x direction. 

The mean and variance of a gamma distribution are 


E(X) =u = af Var(X) = 0? = af 


These can be calculated directly from the gamma pdf using integration by parts, 
or by employing properties of the gamma function along with Expression (3.5); 
see Exercise 83. Notice these are consistent with the aforementioned mean 
and variance of the exponential distribution: with a=1 and 6=1/A we obtain 
E(X) = 1(1/A) = 1/A and Var(X) = 1(1/4)? = 1/2’. 

In the special case where the shape parameter a is a positive integer, n, the 
gamma distribution is sometimes rewritten with the substitution A= 1/f, and the 
resulting pdf is 


Fig. 3.25 (a) Gamma density curves; (b) standard gamma density curves 
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a" . 
f(ajn, 1/a) = ee, x>0 
n | 


This is often called an Erlang distribution, and it plays a central role in the study 
of Poisson processes (again, see Sect. 7.5; notice that the n= 1 case of the Erlang 
distribution is actually the exponential pdf). In Chap. 4, it will be shown that the 
sum of n independent exponential rvs follows this Erlang distribution. 

When X is a standard gamma rv, the cdf of X, which for x > 0 is 


x 


Clvia) =PR <x) =| pay tera (3.7) 


is called the incomplete gamma function (in mathematics literature, the incom- 
plete gamma function sometimes refers to Eq. (3.7) without the denominator I'(a) 
in the integrand]. In Appendix Table A.4, we present a small tabulation of G(x; a) 
for a=1, 2,..., 10 and x=1, 2, ... , 15. Table 3.2 at the end of this section 
provides the Matlab and R commands related to the gamma cdf, which are 
illustrated in the following examples. 


Example 3.27 Suppose the reaction time X (in seconds) of a randomly selected 
individual to a certain stimulus has a standard gamma distribution with a = 2. Since 
X is continuous, 


P(3.<X <5) =P(X <5) — P(X <3) = G(5;2) — G(3;2) = .960 — .801 = .159 


This probability can be obtained in Matlab with cdf (‘gamma’,5,2,1)- 
cdf (’gamma’,3,2,1) andinR with pgamma (5,2) -pgamma (3,2). 
The probability that the reaction time is more than 4 s is 


P(X > 4) =1—P(X < 4) =1—G(4;2) =1— .908 = .092 = 


The incomplete gamma function can also be used to compute probabilities 
involving nonstandard gamma distributions. 


PROPOSITION 
Let X have a gamma distribution with parameters a and /. Then for any x > 0, 


the cdf of X is given by 
x 
IAS 3) = G(S:a), 


the incomplete gamma function evaluated at x/f. 


The proof is similar to that of Eq. (3.5). 
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Example 3.28 Suppose the survival time X in weeks of a randomly selected male 
mouse exposed to 240 rads of gamma radiation has, rather fittingly, a gamma 
distribution with a=8 and #=15. (Data in Survival Distributions: Reliability 
Applications in the Biomedical Services, by A. J. Gross and V. Clark, suggest 
a8.5 and P= 13.3.) The expected survival time is E(X) = (8)(15) = 120 weeks, 


whereas SD(X) = ,/(8)(15)° = 1800 = 42.43 weeks. The probability that a 


mouse survives between 60 and 120 weeks is 
P(60 < X < 120) = P(X < 120) — P(X < 60) 
= G(120/15;8) — G(60/15; 8) 
= G(8;8) — G(4; 8) = .547 — .051 = .496 


In Matlab, the command cdf (’gamma’,120,8,15)-cdf£(’gamma’, 
60,8,15) yields the desired probability; the corresponding R code is pgamma 
(120,8,1/15) -pgamma(60,8,1/15). 

The probability that a mouse survives at least 30 weeks is 

P(X > 30) = 1— P(X < 30) = 1— P(X < 30) = 1 — G(30/15;8) = .999 


3.4.3 The Gamma MGF 


The integral proposition earlier in this section made it easy to determine the mean 
and variance of a gamma rv. However, the moment generating function of the 
gamma distribution — and, as a special case, of the exponential model — will 
prove critical in establishing some of the more advanced properties of these 
distributions in Chap. 4. 


PROPOSITION 
The moment generating function of a gamma random variable is 
1 
My(t) = ——~ t<1/p 
(1 — pr) 
Proof By definition, the mgf is 
oe) xo-l 1 oe) ; 
My(t) = E(e*) = | e* —e*/Pdy a :| xe le“ (CHA x gy 
ie o (ap (a) 6" Jo 


Now use Expression (3.5): provided —t+ 1/6 >0, Le., t< 1/f, 


eee, ( 1 1 
rear,’ ° a= Tae Olas 17e) Gp oe 


The exponential mgf can then be determined with the substitution a = 1, B= 1/A: 
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1 A 
MO= GLa amr <4 


3.4.4 Gamma and Exponential Calculations with Software 


Table 3.2 summarizes the syntax for gamma and exponential probability 
calculations in Matlab and R, which follows the pattern of the other distributions. 
In a sense, the exponential commands are redundant, since they are just a special 
case (a = 1) of the gamma distribution. 

Notice that Matlab and R parameterize the distributions differently: in Matlab, 
both the gamma and exponential functions require / (that is, 1/A) as the last input, 
whereas the R functions take as their last input the “rate” parameter A = 1/f. So, for 
the gamma rv with parameters a= 8 and # = 15 from Example 3.28, the probability 
P(X <30) would be evaluated as cdf(’gamma’,30,8,15) in Matlab but 
pgamma(30,8,1/15) in R. This inconsistency of gamma inputs can be 
remedied by using a name assignment in the last argument in R; specifically, 
pgamma (30,8,scale=15) will instruct R to use #= 15 in its gamma proba- 
bility calculation and produce the same answer as the previous expressions. Inter- 
estingly, as of this writing the same option does not exist in the pexp function. 

To graph gamma or exponential distributions, one can request their pdfs 
by replacing cdf with pdf (in Matlab) or the leading letter p with d (in R). To 
find quantiles of either of these distributions, the appropriate replacements are 
icd€f and q, respectively. For example, the 75th percentile of the gamma distribu- 
tion from Example 3.28 can be determined with icdf (’gamma’, .75,8,15) 
in Matlab or qgamma(.75,8,scale=15) in R (both give 145.2665 weeks). 


Table 3.2 Matlab and R code for gamma and exponential calculations 


Gamma Exponential 
Function: cdf cdf 
Notation: G(x/B; a) F(x Ay=1-e* 
Matlab: cdf (’gamma’ ,x,a,f) ede Vexpy ) sc, 10/2) 
R: pgamma (x,a,1/6) pexp (x,A) 


3.4.5 Exercises: Section 3.4 (71-83) 


71. Let X=the time between two successive arrivals at the drive-up window 
of a local bank. If X has an exponential distribution with A= 1, compute the 
following: 

(a) The expected time between two successive arrivals 

(b) The standard deviation of the time between successive arrivals 
(c) PX <4) 

(d) P2<X<5) 
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72. 


73. 


74. 


75. 


76. 


77. 
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Let X denote the distance (m) that an animal moves from its birth site to 

the first territorial vacancy it encounters. Suppose that for banner-tailed 

kangaroo rats, X has an exponential distribution with parameter A= .01386 

(as suggested in the article “Competition and Dispersal from Multiple Nests,” 

Ecology, 1997: 873-883). 

(a) What is the probability that the distance is at most 100 m? At most 200 m? 
Between 100 and 200 m? 

(b) What is the probability that distance exceeds the mean distance by more 
than 2 standard deviations? 

(c) What is the value of the median distance? 

In studies of anticancer drugs it was found that if mice are injected with cancer 

cells, the survival time can be modeled with the exponential distribution. 

Without treatment the expected survival time was 10 h. What is the probability 

that 

(a) A randomly selected mouse will survive at least 8 h? At most 12 h? 
Between 8 and 12 h? 

(b) The survival time of a mouse exceeds the mean value by more than 
2 standard deviations? More than 3 standard deviations? 

Data collected at Toronto Pearson International Airport suggests that 

an exponential distribution with mean value 2.725 h is a good model for 

rainfall duration (Urban Stormwater Management Planning with Analytical 

Probabilistic Models, 2000, p.69). 

(a) What is the probability that the duration of a particular rainfall event at 
this location is at least 2 h? At most 3 h? Between 2 and 3 h? 

(b) What is the probability that rainfall duration exceeds the mean value 
by more than 2 standard deviations? What is the probability that it is 
less than the mean value by more than one standard deviation? 

Evaluate the following: 

(a) ['(6) 

(b) PG /2) 

(c) G(4; 5) (the incomplete gamma function) 

(d) G5; 4) 

(e) GO; 4) 

Let X have a standard gamma distribution with a = 7. Evaluate the following: 

(a) P(X <5) 

(b) P(X<5) 

(c) P(X>8) 

(d) P3<X <8) 

(e) P(3<X<8) 

(f) P(X <4 or X>6) 

Suppose that when a type of transistor is subjected to an accelerated life test, 

the lifetime X (in weeks) has a gamma distribution with mean 24 weeks and 

standard deviation 12 weeks. 

(a) What is the probability that a transistor will last between 12 and 24 weeks? 

(b) What is the probability that a transistor will last at most 24 weeks? Is the 
median of the lifetime distribution less than 24? Why or why not? 
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78. 


79. 


80. 


81. 


(c) What is the 99th percentile of the lifetime distribution? 

(d) Suppose the test will actually be terminated after t weeks. What value of 
t is such that only .5% of all transistors would still be operating at 
termination? 

The two-parameter gamma distribution can be generalized by introducing a 

third parameter y, called a threshold or location parameter: replace x in 

Eq. (3.6) by x—y and x>0 by x>y. This amounts to shifting the density 

curves in Fig. 3.25 so that they begin their ascent or descent at y rather 

than 0. The article “Bivariate Flood Frequency Analysis with Historical 

Information Based on Copulas” (J. of Hydrologic Engr., 2013: 1018-1030) 

employs this distribution to model X = 3-day flood volume (10° m3). Suppose 

that values of the parameters are a= 12, 8 =7, y = 40 (very close to estimates 
in the cited article based on past data). 

(a) What are the mean value and standard deviation of X? 

(b) What is the probability that flood volume is between 100 and 150? 

(c) What is the probability that flood volume exceeds its mean value by more 
than one standard deviation? 

(d) What is the 95th percentile of the flood volume distribution? 

If X has an exponential distribution with parameter A, derive a general 

expression for the (100p)th percentile of the distribution. Then specialize to 

obtain the median. 

A system consists of five identical components connected in series as shown: 


1 2 3 4 5 


As soon as one component fails, the entire system will fail. Suppose each 
component has a lifetime that is exponentially distributed with A= .01 and 
that components fail independently of one another. Define events A; = {ith 
component lasts at least t hours}, i= 1, ..., 5, so that the A;s are independent 
events. Let X=the time at which the system fails—that is, the shortest 
(minimum) lifetime among the five components. 

(a) The event {X >t} is equivalent to what event involving Aj, ..., As? 

(b) Using the independence of the five A;s, compute P(X > 1). Then obtain 
F()=P(X <d) and the pdf of X. What type of distribution does X have? 

(c) Suppose there are n components, each having exponential lifetime with 
parameter 2. What type of distribution does X have? 

Based on an analysis of sample data, the article “Pedestrians’ Crossing 

Behaviors and Safety at Unmarked Roadways in China” (Accident Analysis 

and Prevention, 2011: 1927-1936) proposed the pdf f(x) = .15e~'°°~ ? when 

x > 1 asa model for the distribution of X = time (sec) spent at the median line. 

This is an example of a shifted exponential distribution, i.e., an exponential 

model beginning at an x-value other than 0. 

(a) What is the probability that waiting time is at most 5 s? More than 5 s? 

(b) What is the probability that waiting time is between 2 and 5 s? 

(c) What is the mean waiting time? 
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(d) What is the standard deviation of waiting times? 
[Hint: For (c) and (d), you can either use integration or write X =Y+1, 
where Y has an exponential distribution with parameter 2 = .15. Then, apply 
rescaling properties of mean and standard deviation. ] 

82. The double exponential distribution has pdf 


f(x) = Sie! for —c0 < x < 00 


The article “Microwave Observations of Daily Antarctic Sea-Ice Edge 
Expansion and Contribution Rates” (EEE Geosci. and Remote Sensing 
Letters, 2006: 54-58) states that “the distribution of the daily sea-ice 
advance/retreat from each sensor is similar and is approximately double 
exponential.” The standard deviation is given as 40.9 km. 

(a) What is the mean of the double exponential distribution? [Hint: Draw a 
picture of the density curve.] 

(b) What is the value of the parameter A? 

(c) What is the probability that the extent of daily sea-ice change is within 
1 standard deviation of the mean value? 

83. (a) Find the mean and variance of the gamma distribution using integration 

and Expression (3.5) to obtain E(X) and E(X°). 

(b) Use the gamma megf to find the mean and variance. 


3.5 Other Continuous Distributions 


The normal, gamma (including exponential), and uniform families of distributions 
provide a wide variety of probability models for continuous variables, but there are 
many practical situations in which no member of these families fits a set of observed 
data very well. Statisticians and other investigators have developed other families 
of distributions that are often appropriate in practice. 


3.5.1 The Weibull Distribution 


The family of Weibull distributions was introduced by the Swedish physicist 
Waloddi Weibull in 1939; his 1951 article “A Statistical Distribution Function of 
Wide Applicability” (J. Appl. Mech., 18: 293-297) discusses a number of 
applications. 


DEFINITION 
A random variable X is said to have a Weibull distribution with parameters a 


and f (a> 0, B > 0) if the pdf of X is 


xt le" > 0 


f(%e,f) = 4 F (3.8) 
0) x<0 
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In some situations there are theoretical justifications for the appropriateness 
of the Weibull distribution, but in many applications f(x; a, #) simply provides 
a good fit to observed data for particular values of a@ and #. When a=1, the 
pdf reduces to the exponential distribution (with 2= 1/f), so the exponential 
distribution is a special case of both the gamma and Weibull distributions. 
However, there are gamma distributions that are not Weibull distributions and 
vice versa, so one family is not a subset of the other. Both a and f can be varied 
to obtain a number of different distributional shapes, as illustrated in Fig. 3.26. 
Note that / is a scale parameter, so different values stretch or compress the graph 
in the x-direction; a is referred to as a shape parameter.Integrating to obtain E(X) 
and E(X”) yields 


1 2 iW 
w=or(1 +2) 7 =f{r(142) - Ir(1+2)| 
a a a 
The computation of 4 and o° thus necessitate using the gamma function from 
Sect. 3.4. (The moment generating function of the Weibull distribution is 
very complicated, and so we do not include it here.) On the other hand, the 


integration (fo: a, B)dy is easily carried out to obtain the cdf of X: 


0 x <0 
Fos B= fy Pome 259 3.9) 


Example 3.29 In recent years the Weibull distribution has been used to model 
engine emissions of various pollutants. Let X denote the amount of NO, emission 
(g/gal) from a randomly selected four-stroke engine of a certain type, and suppose 
that X has a Weibull distribution with a= 2 and / = 10 (suggested by information in 
the article “Quantification of Variability and Uncertainty in Lawn and Garden 
Equipment NO, and Total Hydrocarbon Emission Factors,” J. Air Waste Manag. 
Assoc., 2002: 435-448). The corresponding density curve looks exactly like the one 
in Fig. 3.26 for a=2, f= 1 except that now the values 50 and 100 replace 5 and 
10 on the horizontal axis (because f is a “scale parameter”). Then 


P(X < 10) = F(10;2, 10) = 1 — e910" — 1 6 = 632 


Similarly, P(X < 25) = .998, so the distribution is almost entirely concentrated on 
values between 0 g/gal and 25 g/gal. The value c which separates the 5% of all engines 
having the largest amounts of NO, emissions from the remaining 95%, satisfies 


95 = F(c;2,10) = 1 — e7/19 


Isolating the exponential term on one side, taking logarithms, and solving the 
resulting equation gives c+ 17.3 g/gal as the 95th percentile of the emission 
distribution. a 
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S(x; @ B) 
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a= 1, B= 1 (exponential) 


f(x; & B) 
A 


Fig. 3.26 Weibull density curves 


Frequently, in practical situations, a Weibull model may be reasonable except 
that the smallest possible X value may be some value y not assumed to be zero (this 
would also apply to a gamma model). The quantity y can then be regarded as a third 
parameter of the distribution, which is what Weibull did in his original work. For, 
say, y=3, all curves in Fig. 3.26 would be shifted 3 units to the right. This is 
equivalent to saying that X — y has the pdf Eq. (3.8), so that the cdf of X is obtained 
by replacing x in Eq. (3.9) by x—y. 


Example 3.30 An understanding of the volumetric properties of asphalt is 
important in designing mixtures that will result in high-durability pavement. The 
article “Is a Normal Distribution the Most Appropriate Statistical Distribution for 
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Volumetric Properties in Asphalt Mixtures” (J. of Testing and Evaluation, Sept. 
2009: 1-11) used the analysis of some sample data to recommend that for a 
particular mixture, X = air void volume (%) be modeled with a three-parameter 
Weibull distribution. Suppose the values of the parameters are y=4, a= 1.3, and 
B=.8, which are quite close to estimates given in the article. 

For x > 4, the cumulative distribution function is 


F(x;a, 8,7) = F(x; 1.3,.8,4) =1—ete-4/8)" 
The probability that the air void volume of a specimen is between 5% and 6% is 


P(5 <X <6) = F(6;1.3,.8,4) — F(5;1.3,.8,4) = el-49/8)" — e-l6-4)/ 8)" 
= 263 — .037 = .226 


Figure 3.27 shows a graph of the corresponding Weibull density function, in 
which the shaded area corresponds to the probability just calculated. 
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Fig. 3.27 Weibull density curve with threshold = 4, shape = 1.3, scale = .8 a 


3.5.2 The Lognormal Distribution 


Lognormal distributions have been used extensively in engineering, medicine, and 
more recently, finance. 


DEFINITION 
A nonnegative rv X is said to have a lognormal distribution if the rv 
Y=In(X) has a normal distribution. The resulting pdf of a lognormal rv 
when 1n(X) is normally distributed with parameters y and o is 
1 —[In(x)—p) / (202 
Cee — Tagore ent es 
x<0 
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I(x; 9) 


Fig. 3.28 Lognormal density curves 


Be careful here: the parameters y and o are not the mean and standard deviation 
of X but of In(X). The mean and variance of a lognormal random variable can be 
shown to be 


E(X) = ett#/? Var(X) = e#t® . (e* - 1) 


In Chap. 4, we will present a theoretical justification for this distribution in 
connection with the Central Limit Theorem, but as with other distributions, the 
lognormal can be used as a model even in the absence of such justification. 
Figure 3.28 illustrates graphs of the lognormal pdf; although a normal curve is 
symmetric, a lognormal curve has a positive skew. 

Because In(X) has a normal distribution, the cdf of X can be expressed in terms 
of the cdf ®(z) of a standard normal rv Z. For x > 0, 


F(x; 1,6) = P(x < x) = P[In(X) Z In(x)] =p ae Sih os oe 


wt =9|70# - 


oO Oo 


(3.10) 


=Plz< 


Differentiating F(x; , o) with respect to x gives the pdf f(x; yz, o) above. 


Example 3.31 According to the article “Predictive Model for Pitting Corrosion in 
Buried Oil and Gas Pipelines” (Corrosion, 2009: 332-342), the lognormal distribu- 
tion has been reported as the best option for describing the distribution of maximum 
pit depth data from cast iron pipes in soil. The authors suggest that a lognormal 
distribution with w= .353 and o=.754 is appropriate for maximum pit depth 
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Shaded area = .354 


= Xx 


Fig. 3.29 Lognormal density curve with = .353 and o =.754 


(mm) of buried pipelines. For this distribution, the mean value and variance of pit 
depth are 


E(X) = @353+(.754)"/2 _ 9.6383 _ 1 993 
Var(X) = €2(-353)+(754)" . (e754 _ 1) = (3.57697) (.765645) = 2.7387 


The probability that maximum pit depth is between | and 2 mm is 
P(L<X<2)= P(In(1) < In(X) < In(2)) = P(0 < In(X) < .693) 
pio zo) = (.45) — @(—.47) = .354 
154 ~ ~~ 754 


This probability is illustrated in Fig. 3.29. 
What value c is such that only 1% of all specimens have a maximum pit depth 
exceeding c? The desired value satisfies 


99 = P(X <c) = o(a*) 


Appendix Table A.3 indicates that z = 2.33 is the 99th percentile of the standard 
normal distribution, which implies that 


In(c) — 353 _ 


2.33 
754 
Solving for c gives In(c) = 2.1098 and c= 8.247. Thus 8.247 mm is the 99th 
percentile of the maximum pit depth distribution. a 


As with the Weibull distribution, a third parameter y can be introduced so that 
the domain of the distribution is x > y rather than x > 0. 
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3.5.3 The Beta Distribution 


All families of continuous distributions discussed so far except for the uniform 
distribution have positive density over an infinite interval (although typically the 
density function decreases rapidly to zero beyond a few standard deviations from 
the mean). The beta distribution provides positive density only for X in an interval 
of finite length. 


DEFINITION 
A random variable X is said to have a beta distribution with parameters a, f 
(both positive), A, and B if the pdf of X is 
1 r = a-1 = p-1 
(a+Pp) [x-A B-x Nene 
0 otherwise 


The case A=0, B=1 gives the standard beta distribution. 


Figure 3.30 illustrates several standard beta pdfs. Graphs of the general pdf 
are similar, except they are shifted and then stretched or compressed to fit over 
[A, B]. Unless a and f are integers, integration of the pdf to calculate probabilities is 
difficult, so either a table of the incomplete beta function or software is generally 
used. 

The standard beta distribution is commonly used to model variation in the 
proportion or percentage of a quantity occurring in different samples, such as the 
proportion of a 24-h day that an individual is asleep or the proportion of a certain 
element in a chemical compound. 


Fig. 3.30 Standard beta S0% @ B)A 
density curves 
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The mean and variance of X are 


i os (B — A)’ap 
a+B (at fy (a+ B+) 


w=A+(B-—A)- 


The moment generating function of the beta distribution is too complicated 
to be useful. 


Example 3.32 Project managers often use a method labeled PERT—for program 
evaluation and review technique—to coordinate the various activities making up a 
large project. (One successful application was in the construction of the Apollo 
spacecraft.) A standard assumption in PERT analysis is that the time necessary to 
complete any particular activity once it has been started has a beta distribution with 
A=the optimistic time (if everything goes well) and B=the pessimistic time 
(if everything goes badly). Suppose that in constructing a single-family house, the 
time X (in days) necessary for laying the foundation has a beta distribution with 
A=2,B=5, a=2, and #=3. Then a/(at+ f) = .4, so E(X) =2 4 (3)(4) =3.2. For 
these values of a and f, the pdf of X is a simple polynomial function. The 
probability that it takes at most 3 days to lay the foundation is 


34 Al /x—2\ (5—x\? 
( ) 53 (1.2. 3 3 


Software, including Matlab and R, can be used to perform probability 
calculations for the Weibull, lognormal, and beta distributions. Interested readers 
should consult the help menus in those packages. 


3.5.4 Exercises: Section 3.5 (84-100) 


84. The lifetime X (in hundreds of hours) of a type of vacuum tube has a Weibull 

distribution with parameters a= 2 and / = 3. Compute the following: 
(a) E(X) and Var(X) 
(b) P(X <6) 
(c) PAS <X <6) 

(This Weibull distribution is suggested as a model for time in service in 
“On the Assessment of Equipment Reliability: Trading Data Collection Costs 
for Precision,” J. Engrg. Manuf., 1991: 105-109.) 

85. The authors of the article “A Probabilistic Insulation Life Model for Com- 
bined Thermal-Electrical Stresses” (EEE Trans. Electr. Insul., 1985: 519- 
522) state that “the Weibull distribution is widely used in statistical problems 
relating to aging of solid insulating materials subjected to aging and stress.” 
They propose the use of the distribution as a model for time (in hours) to 
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86. 


87. 


88. 


89. 


90. 


3 Continuous Random Variables and Probability Distributions 


failure of solid insulating specimens subjected to ac voltage. The values of the 

parameters depend on the voltage and temperature; suppose a=2.5 and 

f= 200 (values suggested by data in the article). 

(a) What is the probability that a specimen’s lifetime is at most 250? Less 
than 250? More than 300? 

(b) What is the probability that a specimen’s lifetime is between 100 and 250? 

(c) What value is such that exactly 50% of all specimens have lifetimes 
exceeding that value? 

Let X =the time (in 10~' weeks) from shipment of a defective product until 

the customer returns the product. Suppose that the minimum return time is 

vy =3.5 and that the excess X = 3.5 over the minimum has a Weibull distribu- 

tion with parameters a = 2 and f = 1.5 (see the article “Practical Applications 

of the Weibull Distribution,” Indust. Qual. Control, 1964: 71-78). 

(a) What is the cdf of X? 

(b) What are the expected return time and variance of return time? [Hint: First 
obtain E(X — 3.5) and Var(X — 3.5).] 

(c) Compute P(X > 5). 

(d) Compute P(5 < X <8). 

Let X have a Weibull distribution with the pdf from Expression (3.8). Verify 

that w= /T(1+ 1/a). [Hint: In the integral for E(X), make the change of 

variable y = (x/f), so that x = fy'”.] 

(a) In Exercise 84, what is the median lifetime of such tubes? [Hint: Use 
Expression (3.9).] 

(b) In Exercise 86, what is the median return time? 

(c) If X has a Weibull distribution with the cdf from Expression (3.9), obtain a 
general expression for the (100p)th percentile of the distribution. 

(d) In Exercise 86, the company wants to refuse to accept returns after 
t weeks. For what value of ¢ will only 10% of all returns be refused? 

Let X denote the ultimate tensile strength (ksi) at —200° of a randomly 

selected steel specimen of a certain type that exhibits “cold brittleness” at 

low temperatures. Suppose that X has a Weibull distribution with a= 20 and 

f= 100. 

(a) What is the probability that X is at most 105 ksi? 

(b) If specimen after specimen is selected, what is the long-run proportion 
having strength values between 100 and 105 ksi? 

(c) What is the median of the strength distribution? 

The article “On Assessing the Accuracy of Offshore Wind Turbine 

Reliability-Based Design Loads from the Environmental Contour Method” 

(ntl. J. of Offshore and Polar Engr., 2005: 132-140) proposes the Weibull 

distribution with a= 1.817 and # = .863 as a model for 1-h significant wave 

height (m) at a certain site. 

(a) What is the probability that wave height is at most .5 m? 

(b) What is the probability that wave height exceeds its mean value by more 
than one standard deviation? 

(c) What is the median of the wave-height distribution? 


3.5 Other Continuous Distributions 245 


91. 


92. 


93. 


94. 


95. 


(d) For 0<p< 1, give a general expression for the 100pth percentile of the 
wave-height distribution. 

Nonpoint source loads are chemical masses that travel to the main stem of a 

river and its tributaries in flows that are distributed over relatively long stream 

reaches, in contrast to those that enter at well-defined and regulated points. 

The article “Assessing Uncertainty in Mass Balance Calculation of River 

Nonpoint Source Loads” (J. of Envir. Engr., 2008: 247-258) suggested that 

for a certain time period and location, nonpoint source load of total dissolved 

solids could be modeled with a lognormal distribution having mean value 

10,281 kg/day/km and a coefficient of variation CV = .40 (CV = ox/px). 

(a) What are the mean value and standard deviation of In(X)? 

(b) What is the probability that X is at most 15,000 kg/day/km? 

(c) What is the probability that X exceeds its mean value, and why is this 
probability not .5? 

(d) Is 17,000 the 95th percentile of the distribution? 

The authors of the article “Study on the Life Distribution of Microdrills” (J. of 

Engr. Manufacture, 2002: 301-305) suggested that a reasonable probability 

model for drill lifetime was a lognormal distribution with w= 4.5 and o=.8. 

(a) What are the mean value and standard deviation of lifetime? 

(b) What is the probability that lifetime is at most 100? 

(c) What is the probability that lifetime is at least 200? Greater than 200? 

Use Equation (3.10) to write a formula for the median 7 of the lognormal 

distribution. What is the median for the load distribution of Exercise 91? 

As in the case of the Weibull distribution, the lognormal distribution can be 

modified by the introduction of a third parameter y such that the pdf is shifted 

to be positive only for x > y. The article cited in Exercise 46 suggested that a 

shifted lognormal distribution with shift= 1.0, mean value = 2.16, and stan- 

dard deviation = 1.03 would be an appropriate model for the rv X = maximum- 

to-average depth ratio of a corrosion defect in pressurized steel. 

(a) What are the values of y and o for the proposed distribution? 

(b) What is the probability that depth ratio exceeds 2? 

(c) What is the median of the depth ratio distribution? 

(d) What is the 99th percentile of the depth ratio distribution? 

Sales delay is the elapsed time between the manufacture of a product and its 

sale. According to the article “Warranty Claims Data Analysis Considering 

Sales Delay” (Quality and Reliability Engr. Intl., 2013: 113-123), it is quite 

common for investigators to model sales delay using a lognormal distribution. 

For a particular product, the cited article proposes this distribution with 

parameter values y= 2.05 and o° =.06 (here the unit for delay is months). 

(a) What are the variance and standard deviation of delay time? 

(b) What is the probability that delay time exceeds 12 months? 

(c) What is the probability that delay time is within one standard deviation of 
its mean value? 

(d) What is the median of the delay time distribution? 

(e) What is the 99th percentile of the delay time distribution? 
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(f) Among 10 randomly selected such items, how many would you expect to 
have a delay time exceeding 8 months? 

96. The article “The Statistics of Phytotoxic Air Pollutants” (J. Roy. Statist Soc., 
1989: 183-198) suggests the lognormal distribution as a model for SO, 
concentration above a forest. Suppose the parameter values are w= 1.9 and 
o=.9. 

(a) What are the mean value and standard deviation of concentration? 

(b) What is the probability that concentration is at most 10? Between 5 and 10? 

97. What condition on a and f# is necessary for the standard beta pdf to be 
symmetric? 

98. Suppose the proportion X of surface area in a randomly selected quadrat that is 
covered by a certain plant has a standard beta distribution with a=5 and 
Bp=2. 

(a) Compute E(X) and Var(X). 

(b) Compute P(X <.2). 

(c) Compute P(.2 < X < 4). 

(d) What is the expected proportion of the sampling region not covered by the 
plant? 

99. Let X have a standard beta density with parameters a and /. 

(a) Verify the formula for E(X) given in the section. 

(b) Compute E[(1 —X)’"]. If X represents the proportion of a substance 
consisting of a particular ingredient, what is the expected proportion 
that does not consist of this ingredient? 

100. Stress is applied to a 20-in. steel bar that is clamped in a fixed position at each 
end. Let Y= the distance from the left end at which the bar snaps. Suppose 
Y/20 has a standard beta distribution with E(Y) = 10 and Var(Y) = 100/7. 
(a) What are the parameters of the relevant standard beta distribution? 

(b) Compute P(8 < Y < 12). 

(c) Compute the probability that the bar snaps more than 2 in. from where you 
expect it to snap. 


3.6 Probability Plots 


An investigator will often have obtained a numerical sample consisting of 
n observations and wish to know whether it is plausible that this sample came 
from a population distribution of some particular type (e.g., from a normal distri- 
bution). For one thing, many formal procedures from statistical inference (Chap. 5) 
are based on the assumption that the population distribution is of a specified type. 
The use of such a procedure is inappropriate if the actual underlying probability 
distribution differs greatly from the assumed type. Additionally, understanding the 
underlying distribution can sometimes give insight into the physical mechanisms 
involved in generating the data. An effective way to check a distributional assump- 
tion is to construct what is called a probability plot. The basis for our construction 
is a comparison between percentiles of the sample data and the corresponding 
percentiles of the assumed underlying distribution. 
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3.6.1 Sample Percentiles 


The details involved in constructing probability plots differ a bit from source to 
source. Roughly speaking, sample percentiles are defined in the same way that 
percentiles of a population distribution are defined. The sample 50th percentile (i.e., 
the sample median) should separate the smallest 50% of the sample from the largest 
50%, the sample 90th percentile should be such that 90% of the sample lies below 
that value and 10% lies above, and so on. Unfortunately, we run into problems when 
we actually try to compute the sample percentiles for a particular sample of 
n observations. If, for example, n= 10, we can split off 20% or 30% of the data, 
but there is no value that will split off exactly 23% of these ten observations. To 
proceed further, we need an operational definition of sample percentiles (this is one 
place where different people and different software packages do slightly different 
things). 

Statistical convention states that when n is odd, the sample median is the middle 
value in the ordered list of sample observations, for example, the sixth-largest value 
when n= 11. This amounts to regarding the middle observation as being half in the 
lower half of the data and half in the upper half. Similarly, suppose n = 10. Then if 
we call the third-smallest value the 25th percentile, we are regarding that value as 
being half in the lower group (consisting of the two smallest observations) and half 
in the upper group (the seven largest observations). This leads to the following 
general definition of sample percentiles. 


DEFINITION 
Order the n sample observations from smallest to largest. Then the 
ith-smallest observation in the list is taken to be the [100( — .5)/n]th sample 
percentile. 


For example, if n= 10, the percentages corresponding to the ordered sample 
observations are 100(1 —.5)/10=5%, 100(2—.5)/10=15%, 25%, ..., and 
100(10 — .5)/10 =95%. All other percentiles could then be determined by inter- 
polation, e.g., the sample 10th percentile would then be halfway between the 5th 
percentile (smallest sample observation) and the 15th percentile (second smallest 
observation) of the n= 10 values. For the purposes of a probability plot, such 
interpolation will not be necessary, because a probability plot will be based only 
on the percentages 100(i — .5)/n corresponding to the n sample observations. 


3.6.2 A Probability Plot 


We now wish to determine whether the sample data could plausibly have come 
from some particular population distribution (e.g., a normal distribution with yp = 10 
and o = 3). If the sample was actually selected from the specified distribution, the 
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sample percentiles (ordered sample observations) should be reasonably close to the 
corresponding population distribution percentiles. That is, for i=1, 2, ..., n there 
should be reasonable agreement between the ith-smallest sample observation and 
the theoretical [100( — .5)/n]th percentile for the specified distribution. Consider 
the (sample percentile, population percentile) pairs—that is, the pairs 


ith smallest sample [100(i — .5) /n] th percentile 
observation ’ of the population distribution 
for i=1, ..., n. Each such pair can be plotted as a point on a two-dimensional 


coordinate system. If the sample percentiles are close to the corresponding popula- 
tion distribution percentiles, the first number in each pair will be roughly equal to 
the second number, and the plotted points will then fall close to a 45° line. 
Substantial deviations of the plotted points from a 45° line suggest that the assumed 
distribution might be wrong. 


Example 3.33 The value of a physical constant is known to an experimenter. 
The experimenter makes n= 10 independent measurements of this value using 
a measurement device and records the resulting measurement errors (error = 
observed value — true value). These observations appear in the accompanying 
table. Is it plausible that the random variable measurement error has a standard 
normal distribution? The needed standard normal (z) percentiles are also displayed 
in the table and were determined as follows: the 5th percentile of the distribution 
under consideration, N(0,1), is given by ®(z)=.05. From software or Appendix 
Table A.3, the solution is roughly z= —1.645. The other nine population (z) 
percentiles were found in a similar fashion. 


Percentage 5 15 De 35) 45 
Sample observation =ILOil = ils) = 15 = 33) .20 
z percentile —1.645 —1.037 —.675 —.385 —.126 
Percentage 55 65 18 85 95 
Sample observation ob) ol 87 1.40 1.56 
z percentile .126 .385 675 1.037 1.645 


Thus the points in the probability plot are (—1.91, —1.645), (—1.25, —1.037), 
.., and (1.56,1.645). Figure 3.31 shows the resulting plot. Although the points 
deviate a bit from the 45° line, the predominant impression is that this line fits the 
points very well. The plot suggests that the standard normal distribution is a 
reasonable probability model for measurement error. 
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z percentile 


45° line 
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s 
2 AS 2 25 6 Oe 1 Ie value 


Fig. 3.31 Plots of pairs (observed value, z percentile) for the data of Example 3.33 a 


An investigator is typically not interested in knowing whether a completely 
specified probability distribution, such as the standard normal distribution (normal 
with «=O and o=1) or the exponential distribution with 14=.1, is a plausible 
model for the population distribution from which the sample was selected. Instead, 
the investigator will want to know whether some member of a family of probability 
distributions specifies a plausible model—the family of normal distributions, the 
family of exponential distributions, the family of Weibull distributions, and so 
on. The values of the parameters of a distribution are usually not specified at the 
outset. If the family of Weibull distributions is under consideration as a model for 
lifetime data, the issue is whether there are any values of the parameters a and # for 
which the corresponding Weibull distribution gives a good fit to the data. Fortu- 
nately, it is almost always the case that just one probability plot will suffice for 
assessing the plausibility of an entire family. If the plot deviates substantially from 
a straight line, but not necessarily the 45° line, no member of the family is plausible. 

To see why, let’s focus on a plot for checking normality. As mentioned earlier, 
such a plot can be very useful in applied work because many formal statistical 
procedures are appropriate (give accurate inferences) only when the population 
distribution is at least approximately normal. These procedures should generally 
not be used if the normal probability plot shows a very pronounced departure from 
linearity. The key to constructing an omnibus normal probability plot is the 
relationship between standard normal (z) percentiles and those for any other normal 
distribution, which was presented in Sect. 3.3: 


percentile for a 


Nis.) distaution. = p+ o- (corresponding z percentile) 


Consider first the case 7 =0. Then if each observation is exactly equal to the 
corresponding normal percentile for a particular value of o, the pairs (observation, 
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o-[z percentile]) fall on a 45° line, which has slope 1. This implies that the pairs 
(observation, z percentile) fall on a line passing through (0, 0) but having slope 
o rather than 1. Similarly, the effect of a nonzero value of 4 is simply to change the 
y-intercept from 0 to yp. 


DEFINITION 
A plot of the n pairs 

(ith-smallest observation, [100(¢ — .5)/n]th z percentile) 
on a two-dimensional coordinate system is called a normal probability plot. 
If the sample observations are in fact drawn from a normal distribution with 
mean value yw and standard deviation o, the points should fall close to a 
straight line with slope o and intercept y. Thus a plot for which the points 
fall close to some straight line suggests that the assumption of a normal 
population distribution is plausible. 


Example 3.34 The accompanying sample consisting of n= 20 observations on 
dielectric breakdown voltage of a piece of epoxy resin appeared in the article 
“Maximum Likelihood Estimation in the 3-Parameter Weibull Distribution” 
WEEE Trans. Dielectrics Electr. Insul., 1996: 43-55). Values of (i—.5)/n for 
which z percentiles are needed are (1 — .5)/20 =.025, (2 — .5)/20=.075, ..., and 
975. 


Observation 24.46 25.61 26.25 26.42 26.66 27.15 27.31 27.54 27.74 = 27.94 
Zz percentile 1.96 1.44 1.15 93 76 .60 AS nav all) .06 


Observation 27.98 28.04 28.28 28.49 28.50 28.87 29.11) 29:13) 29:50) 30.88 
z percentile .06 ll) 232) 45 .60 76 43) 1.15 1.44 1.96 


Figure 3.32 shows the resulting normal probability plot. The pattern in the plot is 
quite straight, indicating it is plausible that the population distribution of dielectric 
breakdown voltage is normal. 
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Fig. 3.32 Normal probability plot for the dielectric breakdown voltage sample a 


There is an alternative version of a normal probability plot in which the 
z percentile axis is replaced by a nonlinear probability axis. The scaling on this 
axis is constructed so that plotted points should again fall close to a line when the 
sampled distribution is normal. Figure 3.33 shows such a plot from Matlab, 


Normal Probability Plot 


0.98 


0.95 
0.90 


0.75 


0.50 


Probability 


Fig. 3.33 Normal probability plot of the breakdown voltage data from Matlab 
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obtained using the normplot command, for the breakdown voltage data of 
Example 3.34. The plot remains essentially the same, and it is just the labeling of 
the axis that changes. 


3.6.3 Departures from Normality 


A nonnormal population distribution can often be placed in one of the following 
three categories: 
1. It is symmetric and has “lighter tails” than does a normal distribution; that is, the 
density curve declines more rapidly out in the tails than does a normal curve. 
2. It is symmetric and heavy-tailed compared to a normal distribution. 
3. It is skewed; that is, the distribution is not symmetric, but rather tapers off more 
in one direction than the other. 
A uniform distribution is light-tailed, since its density function drops to zero 
outside a finite interval. The density function f(x) = 1/[x( +x°)]; for —co<x< a, 
is one example of a heavy-tailed distribution, since 1/(1 +x”) declines much less 


rapidly than does e* /2, Lognormal and Weibull distributions are among those that 
are skewed. When the points in a normal probability plot do not adhere to a straight 
line, the pattern will frequently suggest that the population distribution is in a 
particular one of these three categories. 

Figure 3.34 illustrates typical normal probability plots corresponding to three 
situations above. If the sample was selected from a light-tailed distribution, the largest 
and smallest observations are usually not as extreme as would be expected from a 
normal random sample. Visualize a straight line drawn through the middle part of the 
plot; points on the far right tend to be above the line (z percentile > observed value), 
whereas points on the left end of the plot tend to fall below the straight line 
(z percentile < observed value). The result is an S-shaped pattern of the type pictured 
in Fig. 3.34a. For sample observations from a heavy-tailed distribution, the opposite 
effect will occur, and a normal probability plot will have an S shape with the opposite 
orientation, as in Fig. 3.34b. If the underlying distribution is positively skewed 
(a short left tail and a long right tail), the smallest sample observations will be larger 
than expected from a normal sample and so will the largest observations. In this case, 
points on both ends of the plot will fall below a straight line through the middle part, 
yielding a curved pattern, as illustrated in Fig. 3.34c. For example, a sample from a 
lognormal distribution will usually produce such a pattern; a plot of (In(observation), 
z percentile) pairs should then resemble a straight line. 

Even when the population distribution is normal, the sample percentiles will not 
coincide exactly with the theoretical percentiles because of sampling variability. 
How much can the points in the probability plot deviate from a straight-line pattern 
before the assumption of population normality is no longer plausible? This is not an 
easy question to answer. Generally speaking, a small sample from a normal 
distribution is more likely to yield a plot with a nonlinear pattern than is a large 
sample. The book Fitting Equations to Data by Daniel Cuthbert and Fred Wood 
presents the results of a simulation study in which numerous samples of different 
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Fig. 3.34 Probability plots that suggest a non-normal distribution: (a) a plot consistent with a 
light-tailed distribution; (b) a plot consistent with a heavy-tailed distribution; (c) a plot consistent 
with a (positively) skewed distribution 


sizes were selected from normal distributions. The authors concluded that there is 
typically greater variation in the appearance of the probability plot for sample sizes 
smaller than 30, and only for much larger sample sizes does a linear pattern 
generally predominate. When a plot is based on a small sample size, only a very 
substantial departure from linearity should be taken as conclusive evidence of 
nonnormality. A similar comment applies to probability plots for checking the 
plausibility of other types of distributions. 


3.6.4 Beyond Normality 


Consider a generic family of probability distributions involving two parameters, 
0, and 63, and let F(x; 0), 02) denote the corresponding cdf. The family of 
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normal distributions is one such family, with 0;=p, 02=0, and F(x; p, o)= 
®[(x — p)/o]. Another example is the Weibull family, with 0; =a, 0. =f, and 


F(x; 0,8) =1- e GY 


Still another family of this type is the gamma family, for which the cdf is an 
integral involving the incomplete gamma function that cannot be expressed in any 
simpler form. 

The parameters 0; and 6 are said to be location and scale parameters, respec- 
tively, if F(x; 0), 92) is a function of (x — 0,)/02. The parameters yz and o of the normal 
family are location and scale parameters, respectively. Changing y shifts the location 
of the bell-shaped density curve to the right or left, and changing o amounts to 
stretching or compressing the measurement scale (the scale on the horizontal axis 
when the density function is graphed). Another example is given by the cdf 


Faepeg See ~~ <x<0 
A random variable with this cdf is said to have an extreme value distribution. It is 
used in applications involving component lifetime and material strength. 

The parameter / of the Weibull distribution is a scale parameter. However, a is 
not a location parameter but instead is called a shape parameter. The same is true 
for the parameters a and f of the gamma distribution. In the usual form, the density 
function for any member of either the gamma or Weibull distribution is positive for 
x > 0 and zero otherwise. A location (or shift) parameter can be introduced as a 
third parameter y (we did this for the Weibull distribution in Sect. 3.5) to shift the 
density function so that it is positive if x > y and zero otherwise. 

When the family under consideration has only location and scale parameters, 
the issue of whether any member of the family is a plausible population distribution 
can be addressed via a single, easily constructed probability plot. One first obtains 
the percentiles of the standard distribution, the one with 0;=0 and @,=1, for 
percentages 100(@— .5)/n @=1, ..., n). The n (observation, standardized percen- 
tile) pairs give the points in the plot. This is, of course, exactly what we did to obtain 
an omnibus normal probability plot. 

Somewhat surprisingly, this methodology can be applied to yield an omnibus 
Weibull probability plot. The key result is that if X has a Weibull distribution 
with shape parameter a and scale parameter /, then the transformed variable In(X) 
has an extreme value distribution with location parameter 0; =In(f) and scale 
parameter 02 = 1/a (see Exercise 169). Thus a plot of the 


(In(observation), extreme value standardized percentile) 


pairs that shows a strong linear pattern provides support for choosing the Weibull 
distribution as a population model. 


Example 3.35 The accompanying observations are on lifetime (in hours) of power 
apparatus insulation when thermal and electrical stress acceleration were fixed 
at particular values (“On the Estimation of Life of Power Apparatus Insulation 
Under Combined Electrical and Thermal Stress,” JEEE Trans. Electr. Insul., 1985: 
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70-78). A Weibull probability plot necessitates first computing the 5th, 15th, ..., 
and 95th percentiles of the standard extreme value distribution. The (100p)th 
percentile 7, satisfies 


p=F(q)=1-e" 
from which 7, = In(—In(1 — p)). 


Observation 282 501 TAl 851 1072 1122 1202 1585 1905 2138 
In(observation) 5.64 6.22 6.61 Gis | Oi | 7402 | HOS) | Wai | Wea | 740 
Percentile 2.97 1.82 1.25 .84 Sil 2s) | dis) 39) 64 1.10 


The pairs (5.64, —2.97), (6.22, —1.82), ..., (7.67, 1.10) are plotted as points in 
Fig. 3.35. The straightness of the plot argues strongly for using the Weibull 
distribution as a model for insulation life, a conclusion also reached by the author 
of the cited article. 


Percentile 
A 
1 J e 
e 
0 4 e 
-14 . 
| * 
a ie n(x) 
5.5 6.0 6.5 7.0 7.5 
Fig. 3.35 A Weibull probability plot of the insulation lifetime data a 


The gamma distribution is an example of a family involving a shape parameter 
for which there is no transformation into a distribution that depends only on 
location and scale parameters. Construction of a probability plot necessitates first 
estimating the shape parameter from sample data (some general methods for doing 
this are described in Chap. 5). 

Sometimes an investigator wishes to know whether the transformed variable X® 
has a normal distribution for some value of # (by convention, 0 = 0 is identified with 
the logarithmic transformation, in which case X has a lognormal distribution). The 
book Graphical Methods for Data Analysis by John Chambers et al. discusses this 
type of problem as well as other refinements of probability plotting. 


3.6.5 Probability Plots in Matlab and R 


Matlab, along with many statistical software packages (including R), have built-in 
probability plotting commands that vitiate the need for manual calculation of 
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percentiles from the assumed population distribution. In Matlab, the normplot (x) 
command will produce a graph like the one seen in Fig. 3.33, assuming the vector x 
contains the observed data. The R command qqnorm(x) creates a similar graph, 
except that the axes are transposed (ordered observations on the vertical axis, theoreti- 
cal quantiles on the horizontal). Both Matlab and R have a package called probplot 
that, with appropriate specifications of the inputs, can create probability plots for 
distributions besides normal (e.g., Weibull, exponential, extreme value). Refer to the 
help documentation in those languages for more information. 


3.6.6 Exercises: Section 3.6 (101-111) 


101. The accompanying normal probability plot was constructed from a sample of 
30 readings on tension for mesh screens behind the surface of video display 
tubes. Does it appear plausible that the tension distribution is normal? 


z percentile 
A 
2 = 


_ 
o— oe 


= = 


2, | ! 


1_» Tension 
200 250 300 350 


102. A sample of 15 female collegiate golfers was selected and the clubhead 
velocity (km/h) while swinging a driver was determined for each one, 
resulting in the following data (“Hip Rotational Velocities during the Full 
Golf Swing,” J. of Sports Science and Medicine, 2009: 296-299): 


69.0 69.7 72.7 80.3 81.0 
85.0 86.0 86.3 86.7 87.7 
89.3 90.7 91.0 92.5 93.0 


The corresponding z percentiles are 


—1.83 —1.28 —0.97 —0.73 —0.52 
—0.34 —0.17 0.0 0.17 0.34 
0.52 0.73 0.97 1.28 1.83 


Construct a normal probability plot. Is it plausible that the population distri- 
bution is normal? 

103. Construct a normal probability plot for the following sample of observations 
on coating thickness for low-viscosity paint (“Achieving a Target Value for a 
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104. 


105. 


106. 


Manufacturing Process: A Case Study,” J. Qual. Tech., 1992: 22-26). Would 
you feel comfortable estimating population mean thickness using a method 
that assumed a normal population distribution? 


.83 88 88 1.04 1.09 1.12 1.29 1.31 
1.48 1.49 1.59 1.62 1.65 1.71 1.76 1.83 


The article “A Probabilistic Model of Fracture in Concrete and Size Effects on 
Fracture Toughness” (Mag. Concrete Res., 1996: 311-320) gives arguments 
for why fracture toughness in concrete specimens should have a Weibull 
distribution and presents several histograms of data that appear well fit by 
superimposed Weibull curves. Consider the following sample of size n= 18 
observations on toughness for high-strength concrete (consistent with one of 
the histograms); values of p;= (i — .5)/18 are also given. 


Observation 7 58 65 .69 72 74 

Di .0278 .0833 .1389 .1944 .2500 3056 
Observation 77 79 80 81 .82 84 
Di 3611 4167 4722 5278 5833 .6389 
Observation 86 89 91 95 1.01 1.04 
Di 6944 -7500 8056 8611 .9167 9722 


Construct a Weibull probability plot and comment. 
The propagation of fatigue cracks in various aircraft parts has been the subject 
of extensive study. The accompanying data consists of propagation lives 
(flight hours/ 10*) to reach a given crack size in fastener holes for use in 
military aircraft (“Statistical Crack Propagation in Fastener Holes Under 
Spectrum Loading,” J. Aircraft, 1983: 1028-1032): 


.736 863 .865 913 915 .937 983 1.007 
1.011 1.064 1.109 1.132 1.140 1.153 1.253 1.394 


Construct a normal probability plot for this data. Does it appear plausible that 
propagation life has a normal distribution? Explain. 

The article “The Load-Life Relationship for M50 Bearings with Silicon 
Nitride Ceramic Balls” (Lubricat. Engrg., 1984: 153-159) reports the 
accompanying data on bearing load life (million revs.) for bearings tested at 
a 6.45 KN load. 


47.1 68.1 68.1 90.8 103.6 106.0 115.0 
126.0 146.6 229.0 240.0 240.0 278.0 278.0 
289.0 289.0 367.0 385.9 392.0 505.0 


(a) Construct a normal probability plot. Is normality plausible? 
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107. 


108. 


109. 
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(b) Construct a Weibull probability plot. Is the Weibull distribution family 
plausible? 

The accompanying data on rainfall (acre-feet) from 26 seed clouds is taken 

from the article “A Bayesian Analysis of a Multiplicative Treatment Effect in 

Weather Modification” (Technometrics, 1975: 161-166). Construct a proba- 

bility plot that will allow you to assess the plausibility of the lognormal 

distribution as a model for the rainfall data, and comment on what you find. 


4.1 Teal 17.5 31.4 32.7 40.6 92.4 
115:3 118.3 119.0 129.6 198.6 200.7 242.5 
255.0 274.7 274.7 302.8 334.1 430.0 489.1 
703.4 978.0 1656.0 1697.8 2745.6 


The accompanying observations are precipitation values during March over a 
30-year period in Minneapolis-St. Paul. 


77 1.20 3.00 1.62 2.81 2.48 
1.74 47 3.09 1.31 1.87 .96 
81 1.43 1.51 332 1.18 1.89 
1.20 3.37 2.10 59 1.35 90 
1.95 2.20 52 81 4.75 2.05 


(a) Construct and interpret a normal probability plot for this data set. 

(b) Calculate the square root of each value and then construct a normal 
probability plot based on this transformed data. Does it seem plausible 
that the square root of precipitation is normally distributed? 

(c) Repeat part (b) after transforming by cube roots. 

Allowable mechanical properties for structural design of metallic aerospace 

vehicles requires an approval method for statistically analyzing empirical test 

data. The article “Establishing Mechanical Property Allowables for Metals” 

(J. of Testing and Evaluation, 1998: 293-299) used the accompanying data on 

tensile ultimate strength (ksi) as a basis for addressing the difficulties in 

developing such a method. 


122.2 124.2 124.3 125.6 126.3 126.5 126.5 127.2 127.3 
127.5 127.9 128.6 128.8 129.0 129.2 129.4 129.6 130.2 
130.4 130.8 131.3 131.4 131.4 131.5 131.6 131.6 131.8 
131.8 132.3 132.4 132.4 132.5 132.5 132.5 132.5 132.6 
132.7 132.9 133.0 133.1 133.1 133.1 133.1 133.2 133.2 
133.2 133.3 133.3 133.5 133.5 133.5 133.8 133.9 134.0 
134.0 134.0 134.0 134.1 134.2 134.3 134.4 134.4 134.6 
134.7 134.7 134.7 134.8 134.8 134.8 134.9 134.9 135.2 
135.2 135.2 135.3 135.3 135.4 135.5 135:5 135.6 135.6 
135.7 135.8 135.8 135.8 135.8 135.8 135.9 135.9 135.9 
135.9 136.0 136.0 136.1 136.2 136.2 136.3 136.4 136.4 
136.6 136.8 136.9 136.9 137.0 137.1 137.2 137.6 137.6 


(continued) 


3.7. Transformations of a Random Variable 259 


137.8 137.8 137.8 137.9 137.9 138.2 138.2 138.3 138.3 
138.4 138.4 138.4 138.5 138.5 138.6 138.7 138.7 139.0 
139.1 139.5 139.6 139.8 139.8 140.0 140.0 140.7 140.7 
140.9 140.9 141.2 141.4 141.5 141.6 142.9 143.4 143.5 
143.6 143.8 143.8 143.9 144.1 144.5 144.5 147.7 147.7 


Use software to construct a normal probability plot of this data, and comment. 
110. Let the ordered sample observations be denoted by yy, yo, ..., Yn (; being 
the smallest and y, the largest). Our suggested check for normality is to 
plot the (,, ® |i — .5)/n]) pairs. Suppose we believe that the observations 
come from a distribution with mean 0, and let w 1, ..., w, be the ordered 
absolute values of the observed data. A half-normal plot is a probability plot 
of the w;s. More specifically, since P(IZI < w) = P(—w < Z< w) =2®(w) — 1, 
a half-normal plot is a plot of the (w;,® '[( Dpit+1)/2]) pairs, where 
pi=(@—.5)/n. The virtue of this plot is that small or large outliers in the 
original sample will now appear only at the upper end of the plot rather 
than at both ends. Construct a half-normal plot for the following sample of 
measurement errors, and comment: 
—3.78, —1.27, 1.44, —.39, 12.38, —43.40, 1.15, —3.96, —2.34, 30.84. 
111. The following failure time observations (1000s of hours) resulted from 
accelerated life testing of 16 integrated circuit chips of a certain type: 


82.8 11.6 359.5 502.5 307.8 179.7 
242.0 26.5 244.8 304.3 379.1 212.6 
229.9 558.9 366.7 203.6 


Use the corresponding percentiles of the exponential distribution with 2 = 1 to 
construct a probability plot. Then explain why the plot assesses the plausibil- 
ity of the sample having been generated from any exponential distribution. 


3.7 Transformations of a Random Variable 


Often we need to deal with a transformation Y = g(X) of the random variable X. 
Here g(X) could be a simple change of time scale. If X is the time to complete a task 
in minutes, then Y= 60X is the completion time expressed in seconds. How can we 
get the pdf of Y from the pdf of X? Consider first a simple example. 


Example 3.36 The interval X in minutes between calls to a 911 center is exponen- 
tially distributed with mean 2 min, so its pdf fy(x) = .5e > for x > 0. In order to get 
the pdf of Y= 60X, we first obtain its cdf: 
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y/60 


Fy(y) = P(Y < y) = P(60X < y) = P(X < y/60) = Fy(y/60) = | Sed 


== eo! 120 
Differentiating this with respect to y gives fy(y) = (1/ 120)e-*'° for y>0. We see 
that the distribution of Y is exponential with mean 120 s (2 min). 

There is nothing special here about the mean 2 and the multiplier 60. It should 
be clear that if we multiply an exponential random variable with mean yp by a 
positive constant c we get another exponential random variable with mean cu. 1 


Sometimes it isn’t possible to evaluate the cdf in closed form. Could we still find 
the pdf of Y without evaluating the integral? Yes, thanks to the following theorem. 


TRANSFORMATION THEOREM 

Let X have pdf fx(x) and let Y= g(X), where g is monotonic (either strictly 
increasing or strictly decreasing) on the set of all possible values of X, so it 
has an inverse function X = h(Y). Assume that h has a derivative h’'(y). Then 


fr(y) =fx(AQ)) |r’) (3.11) 


Proof Here is the proof assuming that g is monotonically increasing. The proof for 
g monotonically decreasing is similar. First find the cdf of Y: 


Fy(y) = P(Y < y) = P(@(X) <y) = P(X < AQy)) = Fx(hO)) 


The third equality above, wherein g(X) < y is true iff X < g_'(y) =h(y), relies on 
g being a monotonically increasing function. Now differentiate the cdf with respect 
to y, using the Chain Rule: 


fr) = £ Fy(y) = © Fy(hQ)) = Fx(h(y)) -h(y) = f(a) #0) 


dy dy 
The absolute value on the derivative in Eq. (3.11) is needed only in the other 
case where g is decreasing. The set of possible values for Y is obtained by applying 
g to the set of possible values for X. a 


Example 3.37 Let’s apply the Transformation Theorem to the situation introduced 
in Example 3.36. There Y= g(X) = 60X and X = h(Y) = Y/60. 


! 


| 1 1 
h (y)| _ 5e7 —y/120 


fy) =fx(hO)) 


This matches the pdf of Y derived through the cdf in Example 3.36. a 
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a Ix) b SyQ) 
of 1of 
. . | 
8 8 
6 6 4 | 
ro 
4 4 ge 
2 2 a 
0 ! Ley ok >) 
0 5 1.0 1.5 2.0 0 5 1.0 1.5 2.0 


Fig. 3.36 The effect on the pdf if X is uniform on [0, 1] and Y = 2/X 


Example 3.38 Let X ~ Unif[0, 1], so fy@)=1 for O0<x< 1, and define a new 


variable Y= 2\/X. The function g(x) = 2,/x is monotone on [0, 1], with inverse 
x=h(y)=y’/4. Apply the Transformation Theorem: 


fu) =fx(HO |KO) =) =2 os ys2 


The range 0<y<2 comes from the fact that y = 2,/x maps [0, 1] to [0, 2]. 
A graphical representation may help in understanding why the transform Y = 2\/X 
yields f(y) = y/2 if X ~ Unif[0, 1]. Figure 3.36a shows the uniform distribution with 
[0, 1] partitioned into ten subintervals. In Fig. 3.36b the endpoints of these intervals 
are shown after transforming according to y = 2,/x. The heights of the rectangles 
are arranged so each rectangle still has area .1, and therefore the probability in each 
interval is preserved. Notice the close fit of the dashed line, which has the equation 


Sry) = y/2. 7 


Example 3.39 The variation in a certain electrical current source X (in milliamps) 
can be modeled by the pdf 


pas 1.25 —.25x 2<x<4 
ae 0 otherwise 


If this current passes through a 220-Q resistor, the resulting power 
Y (in microwatts) is given by the expression Y= 220X*. The function y=g(x)= 
220x° is monotonically increasing on the range of X, the interval [2, 4], and has 
inverse function x = h(y) = g-!(y) = ,/y/220. (Notice that g(x) is a parabola and 
thus not monotone on the entire real number line, but for the purposes of the 
theorem g(x) only needs to be monotone on the range of the rv X.) Apply Eq. (3.11): 
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a b 
IQ) SAY) 
ry ry 

0.8 + 0.0008 4 
0.6 + 0.0006 + 
0.4 4 0.0004 4 
0.2 4 0.0002 4 

0 : : : >x 0 : 7 ; i >) 

2 3 4 0 880 1760 2640 3520 
Fig. 3.37 pdfs from Example 3.39: (a) pdf of X; (b) pdf of Y a 


fry) =fx(h(v)) - [2 0)| 
= fx( v/97220) | © 9/220 


1 5 1 
= (1.25 — .25,/y/220) - = 
( y/ ) 2/220y 8/220» 1760 


The set of possible Y-values is determined by substituting x =2 and x=4 into 
g(x) =220x; the resulting range for Y is [880, 3520]. Therefore, the pdf of 
Y= 220K is 


5 1 
—--—— 880<y< 3520 
f(y) = 4 8V220y 1760 ee 
0 otherwise 


The pdfs of X and Y appear in Fig. 3.37. 


The Transformation Theorem requires a monotonic transformation, but there are 
important applications in which the transformation is not monotone. Nevertheless, 
it may be possible to use the theorem anyway with a little trickery. 


Example 3.40 In this example, we start with a standard normal random variable Z, 
and we transform to Y=Z>. Of course, this is not monotonic over the interval for Z, 
(—0o, oo). However, consider the transformation U = IZ]. Because Z has a symmet- 
ric distribution, the pdf of U is fy(u) = f7(@) + fz(—u) = 2 f-(u). (Don’t despair if this 
is not intuitively clear, because we’ll verify it shortly. For the time being, assume it 
to be true.) Then Y= Z? =|Z’ =U", and the transformation in terms of U is 
monotonic because its set of possible values is [0, oo). Thus we can use the 
Transformation Theorem with h(y) = yi = 
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frly) =fulhAO)|A 0) = 2fxlAO)] [2 0)| 


2 —.5(y¥2)" 1 -1/2 1 -y/2 
= —__— ‘ a = — # 0 
27 7 ad 


This distribution is known as the chi-squared distribution with one degree 
of freedom. Chi-squared distributions arise frequently in statistical inference 
procedures, such as those in Chap. 5. 

You were asked to believe intuitively that f((u)=2f,(u). Here is a little 
derivation that works as long as f7(z) is an even function, [1.e., f7(—z) =f7(2)]. 
If u>0, 


Fy(u) = P(U <u) = P(\Z| <u) = P(-w <Z <u) =2PO<Z<u) 
= 2[Fz(u) — Fz(0)). 


Differentiating this with respect to u gives fy(u) =2 fZ(u). a 


Example 3.41 Sometimes the Transformation Theorem cannot be used at all, 
and you need to use the cdf. Let fyx) =(*+ 1)/8, —1 <x <3, and Y=X*. The 
transformation is not monotonic on (—1, 3) and fx(x) is not an even function. 
Possible values of Y are {y: 0<y <9}. Considering first O<y <1, 

Vy uti Jy 


Fx()) =P Sy) = POP Sy) =P-v9 Xs v9) =| g ha 


Then, on the other subinterval, 1 <y <9, 


FyQ) =P(Y <y) =P <9) =P(—. fy 32 <4) = P(-1 5X <4) 


Yuti 
=| gu (1+ y+2,/y)/16 
-1 
Differentiating, we get 

1 

a O0<y<l 

BV 

= + 

16y 

0 otherwise 


Figure 3.38 shows the pdfs of both X and Y. 
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a b 
Sx) Sy) 

A A 
1/2 4 
3/8 4 
1/44 


1/8 5 


1 5 9 


Fig. 3.38 pdfs from Example 3.41: (a) pdf of X; (b) pdf of Y a 


3.7.1. Exercises: Section 3.7 (112-128) 


112. 


113. 


114. 
115. 


116. 


117. 


118. 
119. 


Relative to the winning time, the time X of another runner in a ten kilometer 
race has pdf f(x) = 2/x°, x > 1. The reciprocal Y= 1/X represents the ratio of 
the time for the winner divided by the time of the other runner. Find the pdf of 
Y. Explain why Y also represents the speed of the other runner relative to the 
winner. 

Let X be the fuel efficiency in miles per gallon of an extremely inefficient 
vehicle (a military tank, perhaps?), and suppose X has the pdf fy(x) = 2x, 
0 <x< 1. Determine the pdf of Y= 1/X, which is fuel efficiency in gallons per 
mile. [Note: The distribution of Y is a special case of the Pareto distribution 
(see Exercise 10).] 

Let X have the pdf fy(x) = 2/x°, x > 1. Find the pdf of Y = /X. 

Let X have an exponential distribution with mean 2, sofy(x) = 5 ea 0. 


Find the pdf of Y = /X. [Note: Suppose you choose a point in two dimensions 
randomly, with the horizontal and vertical coordinates chosen independently 
from the standard normal distribution. Then X has the distribution of the 
squared distance from the origin and Y has the distribution of the distance 
from the origin. Y has a Rayleigh distribution (see Exercise 4).] 

If X is distributed as N(y, 0), find the pdf of Y = e*. Verify that the distribution 
of Y matches the lognormal pdf provided in Sect. 3.5. 

If the side of a square X is random with the pdf f(x) = x/8, 0< x <4, and Y is 
the area of the square, find the pdf of Y. 

Let X ~ Unif[0, 1]. Find the pdf of Y= —In(x). 

Let X ~ Unif[0, 1]. Find the pdf of Y = tan[(X — .5)]. [Note: The random variable 
Y has the Cauchy distribution, named after the famous mathematician. ] 
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120. If X~Unif[0, 1], find a linear transformation Y=cX+d such that Y is 
uniformly distributed on [A, B], where A and B are any two numbers such 
that A < B. Is there any other solution? Explain. 

121. If X has the pdf fx(x) =x/8, 0<x<4, find a transformation Y= g(X) such 
that Y~ Unif[0, 1]. [Hint: The target is to achieve f(y)=1 for O<y<1. 
The Transformation Theorem will allow you to find h(y), from which g(x) can 
be obtained. | 

122. If a measurement error X is uniformly distributed on [—1, 1], find the pdf of 
Y =IXI, which is the magnitude of the measurement error. 

123. If X~Unif[—1, 1], find the pdf of Y= xX’. 

124. Ann is expected at 7:00 pm after an all-day drive. She may be as much as | h 
early or as much as 3 h late. Assuming that her arrival time X is uniformly 
distributed over that interval, find the pdf of Ix — 7I, the unsigned difference 
between her actual and predicted arrival times. 

125. If X~Unif[—1, 3], find the pdf of Y=X?’. 

126. If a measurement error X is distributed as N(0, 1), find the pdf of IX1, which is 
the magnitude of the measurement error. 

127. A circular target has radius 1 foot. Assume that you hit the target (we shall 
ignore misses) and that the probability of hitting any region of the target is 
proportional to the region’s area. If you hit the target at a distance Y from the 
center, then let X = xY" be the corresponding area. Show that 
(a) X is uniformly distributed on [0, x]. [Hint: Show that Fy(x) = P(X <x) = 

x/n.] 
(b) Y has pdf fy(y) =2y,0<y< 1. 

128. In Exercise 127 suppose instead that Y is uniformly distributed on [0, 1]. Find 
the pdf of X =xY*. Geometrically speaking, why should X have a pdf that is 
unbounded near 0? 


3.8 Simulation of Continuous Random Variables 


In Sects. 1.6 and 2.8, we discussed the need for simulation of random events 
and discrete random variables in situations where an “analytic” solution is very 
difficult or simply not possible. This section presents methods for simulating 
continuous random variables, including some of the built-in simulation tools of 
Matlab and R. 


3.8.1. The Inverse CDF Method 


Section 2.8 introduced the inverse cdf method for simulating discrete random 
variables. The basic idea was this: generate a Unif[0, 1) random number and 
align it with the cdf of the random variable X we want to simulate. Then, determine 
which X value corresponds to that cdf value. We now extend this methodology to 
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the simulation of values from a continuous distribution; the heart of the algorithm 
relies on the following theorem, often called the probability integral transform. 


THEOREM 
Consider any continuous distribution with pdf f and cdf F. Let U ~ Unif[0, 1), 
and define a random variable X by 


or (3.12) 


Then the pdf of X is f- 


Before proving this theorem, let’s consider its practical usage: Suppose we want 
to simulate a continuous rv whose pdf is f(x), i.e., obtain successive values of 
X having pdf f(x). If we can compute the corresponding cdf F(x) and apply its 
inverse F~' to standard uniform variates Uy, ..., Uy, the theorem states that the 
resulting variates x, =F his =f ‘(Un) will follow the desired distribu- 
tion f. (We’ll discuss the practical difficulties of implementing this method a little 
later.) A graphical description of the algorithm appears in Fig. 3.39. 


Proof Apply the Transformation Theorem (Sect. 3.7) with fy(u) = 1 forO<u<1, 
X=g(U)=F '\(U), and thus U=h(X)=g '(X)=F(X). The pdf of the 
transformed variable X is 


f(x) =fula(a))- |h (| =fo(FQ@)) -|F | = 1- | =F@) 


In the last step, the absolute values may be removed because a pdf is always 
nonnegative. a 


The following box explains the implementation of the inverse cdf method 
justified by the preceding theorem. 


Fig. 3.39 The inverse cdf 
method, illustrated 
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INVERSE CDF METHOD 

It is desired to simulate n values from a distribution pdf f(x). Let F(x) be the 
corresponding cdf. Repeat n times: 

1. Use arandom-number generator (RNG) to produce a value, u, from [0, 1). 
2. Assign x= F lu). 

The resulting values x;, ..., x, form a simulation of a random variable with 
the original pdf, f(x). 


Example 3.42 Consider the electrical current distribution model of Example 3.11, 
where the pdf of X is given by f(x) = 1.25 — .25x for 2 <x < 4. Suppose a simulation 
of X is required as part of some larger system analysis. To implement the above 
method, the inverse of the cdf of X is required. First, compute the cdf: 


FQ) = P(X <x) = [ toa 


= | (1.25 — .25y)dy = —0.125x7 + 1.25x-2, 2<x<4 
2 


To find the probability integral transform Eq. (3.12), set u = F(x) and solve for x: 
u = F(x) = —0.125x7 + 1.25x -2 > x=F'(u) =5—J9—8u 


The equation above has been solved using the quadratic formula; care must be 
taken to select the solution whose values lie in the interval [2, 4] (the other solution, 
x =5+</9— 8u, does not have that feature). Beginning with the usual Unif[0, 1) 
RNG, the algorithm for simulating X is the following: given a value u from the RNG, 
assignx = 5 — V9 — 8u. Repeating this algorithm n times gives n simulated values of 
X. Programs in Matlab and R that implement this algorithm appear in Fig. 3.40; both 
return a vector, x, containing n = 10,000 simulated values of the specified distribution. 


x=zeros(10000,1); b <- NULL 

for i=1:10000 for (i in 1:10000) { 
u=rand; u<-runif (1) 
x(i)=5-sqrt (9-8*u) ; x[2]<=5=sqrt (9=8*u) 

end } 


Fig. 3.40 Simulation code for Example 3.42: (a) Matlab; (b) R 


As discussed in Chap. 1, both of these programs can be accelerated by 
“vectorizing” the operations rather than using a for loop. In fact, a single line of 
code in either language can produce the desired result: 


in Matlab: x=5-sqrt(9-8*rand(10000,1) ) 
in R: x<-5-sqrt (9-8*runif (10000) ) 
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Fig. 3.41 (a) Theoretical pdf and (b) R simulation results for Example 3.42 


The pdf of the rv X and a histogram of simulation results from R appear in 
Fig. 3.41. a 


Example 3.43 The lifetime of a certain type of drill bit has an exponential 
distribution with mean 100 h. An analysis of a large manufacturing process that 
uses these drill bits requires the simulation of this lifetime distribution, which can 
be achieved through the inverse cdf method. From Sect. 3.4, the cdf of this 
exponential distribution is F(x)=1-— e “'* and so the inverse cdf_ is 
x=F'(uv) =—100In(1 — uv). Applying this function to Unif[0, 1) random numbers 
will generate the desired simulation. (Don’t let the negative sign at the front worry 
you: since 0 <u < 1, 1 — u lies between 0 and 1, and so its logarithm is negative and 
the resulting value of x is actually positive.) 

As a check, the code x=-100*log(1-rand(10000,1)) was submitted 
to Matlab and the resulting sample mean and sd were obtained using mean (x) and 
std (x). Exponentially distributed rvs have standard deviation equal to the mean, 
so the theoretical answers are 4 = 100 and o = 100. The Matlab simulation yielded 
xX = 99.3724 and s= 100.8908, both of which are reasonably close to 100 and 
validate the inverse cdf formula. 

In general, an exponential distribution with mean yp (equivalently, parameter 
A= 1/u) can be simulated using the transform x = —yln(1 — uw). a 


The preceding two examples illustrated the inverse cdf method for fairly 
simple density functions: a linear polynomial and an exponential function. In 
practice, the algebraic complexity of f(x) can often be a barrier to implementing 
this simulation technique. After all, the algorithm requires that we can (1) obtain 
the cdf F(x) in closed form and (2) find the inverse function of F in closed form. 
Consider, for example, attempting to simulate values from the N(0, 1) distribution: 
its cdf is the function denoted @(z) and given by the integral expression 
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Z 


(1 /v2—)| e/? dy. There is no closed-form expression for this integral, 


let alone a method to solve u = ®(z) for z and implement Eq. (3.12). (As a reminder, 
the lack of a closed-form expression for ®(z) is the reason that software or tables 
are always required for calculations involving normal probabilities.) Thankfully, 
most software packages, including Matlab and R, have built-in tools to simulate 
normally distributed variates (using a very clever algorithm called the Box-Muller 
method; see Sect. 4.6). We’ll discuss built-in simulation tools at the end of this 
section. 

As the next example illustrates, even when F(x) can be determined in closed 
form we cannot necessarily implement the inverse cdf method, because F(x) cannot 
always be inverted. This difficulty surfaces in practice when attempting to simulate 
values from a gamma distribution. 


Example 3.44 The measurement error X (in mV) of a particular volt-meter has 
the following distribution: f(x) = (4 — x9 for —1 <x <2 (and f(x) = 0 otherwise). 
To use the inverse cdf method to simulate X, begin by calculating its cdf: 


bg = 4199 411 
F(x) -| Ca py=—— 


To implement step 2 of the inverse cdf method requires solving F(x) =u for x; 
since F(x) is a cubic polynomial, this is not a simple task. Advanced computer 
algebra systems can solve this equation, though the general solution is unwieldy 
(and such a solution doesn’t exist at all for Sth-degree and higher polynomials). 
Readers familiar with numerical analysis methods may recognize that, for any 
specified numerical value of u, a root-finding algorithm (such as Newton—Raphson) 
can be implemented to approximate the solution x. This latter method, however, is 
computationally intensive, especially if it’s desirable to generate 10,000 or more 
simulated values of x. a 


The preceding example suggests that the inverse cdf method is insufficient for 
simulating all continuous distributions in practice. We next consider an alternative 
algorithm that, while less efficient, has a broader scope. 


3.8.2 The Accept-Reject Method 


When the inverse cdf method of simulation cannot be implemented, the accept— 
reject method provides an alternative. The downside of the accept—reject method, 
as will be explained below, is that only some of the random numbers generated 
by software will be used (“accepted”), while others will be “rejected.” As a result, 
one needs to create more—sometimes, many more—random variates than the 
desired number of simulated values. 
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Suppose we wish to simulate a random variable X, whose pdf is f(x). The key to 
the accept—reject method is to begin with a different pdf, call it g(x), that satisfies 
two properties: (1) we can already simulate values from g(x), so g is either 
algebraically simple or else built into our software package; (2) the set of possible 
x-values for the distribution specified by g(x) equals (or exceeds) that of f(x). For 
example, to simulate the distribution in Example 3.44, whose range of x-values is 
[—1, 2], one might select for g(x) the uniform distribution on [—1, 2], i.e., g(x) = 1/3 
for —1 <x <2. IfX takes on values across [0, oo), then an exponential pdf would be 
a logical choice for g(x). 


ACCEPT-REJECT METHOD 

It is desired to simulate n values from a distribution pdf f(x). Let g(x) be 

some other pdf such that the ratio f/g is bounded, i.e., there exists a constant 

c such that f(x)/g(x) <c for all x. (The constant c is sometimes called the 

majorization constant.) Proceed as follows: 

1. Generate a variate, y, from the distribution g. This value y is called a 
candidate. 

2. Generate a standard uniform variate, u. 

3. If u-c-g(y)<fy), then assign x=y (ie., “accept” the candidate). 
Otherwise, discard (“reject”) y and return to step 1. 

These steps are repeated until n candidate values have been accepted. 

The resulting accepted values x,, ..., x, form a simulation of a random 

variable with the original pdf, f(x). 


A proof that the method works—.e., that the resulting values really do simulate 
the target distribution f(x)— requires material from Chap. 4 (see Exercise 22 at the 
end of Sect. 4.1). 


Fig. 3.42 The accept—reject ry 
method 


y (candidate) 


Figure 3.42 illustrates the key step in this algorithm. A candidate y has been 
generated on the common interval of the pdfs f and g. Given y, the left-hand side 
of the inequality in step 3, u-c-g(y), is uniformly distributed on the interval 
from 0 to c- g(y) (since u itself is standard uniform). If u-c- g(y) falls between 
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0 and f(y), i.e., lies underneath the target pdf f, then that y-value is accepted 
as coming from f; otherwise, y is rejected. 

As a corollary to proving the validity of the accept—reject method, it can also 
be shown that the probability any particular candidate y is accepted equals 1/c. 
(The value of c must always exceed 1; can you see why?) Since successive 
candidates are independent, it follows that the number of candidates required to 
generate a single acceptable value has a geometric distribution, and the expected 
number of candidates to generate one x from f(x) is 1/(1/c)=c. By extension, 
the expected number of candidates required to generate our simulation sample 
of size n is cn. Consequently, the majorization constant c should always be 
made as small as possible, i.e., we should find the smallest value c such that 
S)/g@) <c for all x under consideration. 


Example 3.45 (Example 3.44 continued) In order to simulate 10,000 values 
from the pdf f(x) = (4 —2°)/9, —1 <x <2, we will rely on our ability to generate 
variates from g(x) = 1/3 on —1 <x < 2, the uniform pdf. To implement the accept— 


a b 
x=zeros (10000,1); x <- NULL 
i=0; i. <= 9 
while i<10000 while (i<10000) { 
y=-1+3*rand; y <- -1+3*runif (1) 
u=rand; u <- runif (1) 
if u*4/3*1/3<=(4-y%2) /9 if (u*4/3*1/3<=(4-y%*2) /9) { 
i=it+l; i <- itl 
x(1)=y; x [ 2) <= -y 
end } 
end } 


Fig. 3.43 Simulation code for Example 3.45: (a) Matlab; (b) R 
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Fig. 3.44 pdf and histogram of simulated values for Example 3.45 
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reject method, we must determine the majorization constant, c, by looking at the 
ratio f/g: 


f(x) ae) /2 Aer 20" A 


for—1l1<x<2 


g(x) 1/3 7 — 3 3 


The expression 4—.° represents a downward-facing parabola with vertex 
at x=0, so it is clearly maximized at 0. We conclude that c= 4/3 is the smallest 
possible majorization constant, and that is what we shall use. To create the 
desired simulation, the following steps are repeated until 10,000 values are 
accepted in step 3. 

1. Generate y from the uniform distribution on [—1, 2]. 
2. Generate uv from the standard uniform RNG. 
2 
ia 
3 3 9 

Figure 3.43 shows the preceding algorithm implemented in Matlab and R. 
Both programs result in a vector of 10,000 simulated values from the pdf f(x). 
Figure 3.44 shows f(x) alongside the simulated values from Matlab. Since c= 4/3, 
it’s expected to require 4/3(10,000) = 13,333 iterations of the while loop to create 
the desired simulation size; by adding a counter to the program, one run of the 
Matlab code was found to use 13,303 candidates. 

You may have noticed that step 3 may be simplified: the inequality 
u < (4—x°)/4 would be equivalent to the one presented. In fact, it is very common 
to see this final step of the accept—reject algorithm written as “accept y iff 


u<f(y/le-g(y)].” | 


, assign x = y; otherwise, discard y and return to step 1. 


For more information on the accept—reject method and selection of a sensible 
“candidate” distribution g(x) consult the text Simulation by Ross listed in the 
references. 


3.8.3 Built-In Simulation Packages for Matlab and R 


As was true for the most common discrete distributions, many software packages 
have built-in tools for simulating values from the continuous models named in this 
chapter. Table 3.3 summarizes the relevant functions in Matlab and R for the 
uniform, normal, gamma, and exponential distributions; the variable n refers to 
the desired number of simulated values of the distribution. Both packages include 
similar commands for the Weibull, lognormal, and beta distributions. 

As was the case with the cdf commands discussed in Sect. 3.4, Matlab and R 
parameterize the gamma and exponential distributions differently: Matlab always 
requires the “scale” parameter /=1/A, while R takes in the “rate” parameter 
A= 1/f. (In the gamma simulation command, this can be overridden by naming 
the final argument scale, as in rgamma (n,a, Scale=/).) InR, the command 
rnorm(n) will generate standard normal variates (i.e., with = 0 and o= 1), but 
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Table 3.3 Functions to simulate major continuous distributions in Matlab and R 


Distribution Matlab code R code 

Unif[A, B] random(’unif’,A,B,[n,1] runif (n,A,B) 
N(u, o) random(’norm’ ,m, o, [n, “ ) rnorm(n, ,o) 
Gamma(a, /) random(’gam’,a,f, [n,1]) rgamma (n,a, 1/B) 
Exponential(A) random(’exp’, 1/A, [n,1]) rexp (n,A) 


those arguments are required in Matlab. Similarly, R will generate standard uniform 
variates (A = 0 and B = 1), the basis for many of our simulation methods, with the 
command runif (n). Matlab’s corresponding syntax is rand([n,1]); if you 
type rand(100) instead of rand([100,1]), you will receive a 100-by-100 
matrix of Unif[0, 1) values. 


3.8.4 Precision of Simulation Results 


Sect. 2.8 discusses in detail the precision of estimates associated with simulating 
discrete random variables. The same results apply in the continuous case. In 
particular, the estimated standard error in using a sample proportion p to estimate 


the true probability of an event is still ,/p (1 — p )/n, where n is the simulation size. 
Also, the estimated standard error in using a sample mean, X, to estimate the true 
expected value y of a (continuous) rv X is s/,/n, where s is the sample standard 
deviation of the simulated values of X. Refer back to Sect. 2.8 for more details. 


3.8.5 Exercises: Section 3.8 (129-139) 


129. The amount of time (hours) required to complete an unusually short 
statistics homework assignment is modeled by the pdf f(x) =x/2 for0O<x<2 
(and = 0 otherwise). 
(a) Obtain the cdf and then its inverse. 
(b) Write a program to simulate 10,000 values from this distribution. 
(c) Compare the sample mean and standard deviation of your 10,000 
simulated values to the theoretical mean and sd of this distribution 
(which you can determine by calculating the appropriate integrals). 
130. The Weibull distribution was introduced in Sect. 3.5. 
(a) Find the inverse cdf for the Weibull distribution. 
(b) Write a program to simulate n values from a Weibull distribution. Your 
program should have three inputs: the desired number of simulated values 
n and the two parameters a and /. It should have a single output: ann x 1 
vector of simulated values. 
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132. 


133. 
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(c) Use your program from part (b) to simulate 10,000 values from a Weibull 
(4, 6) distribution and estimate the mean of this distribution. The correct 
value of the mean is 61°(5/4) © 5.438; how close is your sample mean? 

Consider the pdf for the rv X = magnitude (in newtons) of a dynamic load ona 

bridge, given in Example 3.7: 


0 otherwise 


Write a program to simulate values from this distribution using the inverse cdf 
method. 

In distributed computing, any given task is split into smaller sub-tasks which 
are handled by separate processors (which are then recombined by a multi- 
plexer). Consider a distributed computing system with 4 processors, and 
suppose for one particular purpose that pdf of completion time for a particular 
sub-task (microseconds) on any one of the processors is given by 


20 
(ate 
0 otherwise 


That is, the sub-task completion times X,, X2, X3, X4 of the four processors 

each have the above pdf. 

(a) Write a program to simulate the above pdf using the inverse cdf method. 

(b) The overall time to complete any task is the largest of the four sub-task 
completion times: if we call this variable Y, then Y= max(X,, Xo, X3, X4). 
(We assume that the multiplexing time is negligible). Use your program in 
part (a) to simulate 10,000 values of the rv Y. Create a histogram of the 
simulated values of Y, and also use your simulation to estimate both E(Y) 
and SD(Y). 

Exercise 16 in Sect. 3.1 introduced the following model for wait times at street 

crossings: 


@ 0-1 
teeges SS 


0) otherwise 


where 0 > 0 and zt > 0 are the parameters of the model. 

(a) Write a function to simulate values from this distribution, implementing 
the inverse cdf method. Your function should have three inputs: the 
desired number of simulated values n and values for the two parameters 
for @ and t. 

(b) Use your function in part (a) to simulate 10,000 values from this wait time 
distribution with @=4 and t= 80. Estimate F(X) under these parameter 
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134. 


135. 


136. 


settings. How close is your estimate to the correct value of 16? 

Explain why the transformation x = —yIn(u) may be used to simulate values 
from an exponential distribution with mean y. (This expression is slightly 
simpler than the one established in this section.) 

Recall the rv X =amount of gravel (in tons) sold by a construction supply 
company in a given week from Example 3.9, whose pdf is 

3 2 

~(1 — x*) O0<x<l 
f(x) = 4 2 


0 otherwise 


Consider simulating values from this distribution using the accept—reject 
method with a Unif[0, 1] “candidate” distribution, i.e., g(x) = 1 forO<x< 1. 
(a) Find the smallest majorization constant c so that fx)/g(x) <c for all x in 
[0, 1]. 
(b) Write a program to simulate values from this distribution. 
(c) On the average, how many candidate values must your program generate 
in order to create 10,000 “accepted” values? 
(d) Simulate 10,000 values from this distribution, and use these to estimate 
the mean y of this distribution. How close is your sample mean to the true 
value of « (which you can determine using the appropriate integral)? 
The supply company’s management looks at quarterly data for X, i.e., 
values X,, ..., X13 for 13 weeks (one quarter of a year). Of particular 
interest is the variable M=min(X, ..., X13), the least amount of gravel 
sold in one week during a quarter. Use your program in (b) to simulate the 
rv M, and use the results of at least 10,000 simulated values of M to 
estimate P(M <.1), the chance that the worst sales week in a quarter saw 
less than .1 tons of gravel sold. [Hint: Simulate each X; 10,000 times for 
i=1,..., 13, and then compute the minimum of each set of 13 values to 
create a value for M.] 
The time required to complete a 3-h final exam is modeled by the following 
pdf: 


(e 


wm 


0 otherwise 


Consider simulating values from this distribution using the accept—reject 

method with a uniform “candidate” distribution on the interval [0, 3]. 

(a) Find the smallest majorization constant c so that f(x)/g(x) <c for all x in 
[0, 3]. [Hint: What is the pdf of the uniform distribution on [0, 3]?] 

(b) Write a program to simulate values from this distribution. 

(c) On the average, how many candidate values must your program generate 
in order to create 10,000 “‘accepted” values? 
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(d) A professor has 20 students taking her class (lucky professor!). Assume 
her 20 students’ completion times on the final exam can be modeled as 
20 independent observations from the above pdf. The professor must stay 
at the final exam until all 20 students are finished (i.e., until the last 
student leaves). Use your program in (b) to simulate the rv L = time, in 
hours, that the professor sits proctoring her final exam to 20 students. Use 
your simulation to estimate P(L > 35/12), the probability she will have to 
stay into the last 5 min of the final exam period. 

The half-normal distribution has the following pdf: 


foy=} Var? #20 


0 otherwise 


This is the distribution of IZl, where Z~N(O, 1); equivalently, it’s the pdf 
that arises by “folding” the standard normal distribution in half along its 
line of symmetry. Consider simulating values from this distribution using 
the accept-reject method with a candidate distribution g(x) =e ~ for x>0 
(i.e., an exponential pdf with 2= 1). 

(a) Find the inverse cdf corresponding to g(x). (This will allow us to simulate 
values from the candidate distribution.) 

(b) Find the smallest majorization constant c so that f(x)/g(x) < c for all x > 0. 
[Hint: Use calculus to determine where the ratio f(x)/g(x) is maximized.] 

(c) On the average, how many candidate values will be required to generate 
10,000 “accepted” values? 

(d) Write a program to construct 10,000 values from a_half-normal 
distribution. 

As discussed previously, the normal distribution cannot be simulated using 

the inverse cdf method. One possibility for simulating from a standard normal 

distribution is to employ the accept—reject method with candidate distribution 
1 
a —00 <x< 00 

(This is the Cauchy distribution.) 

(a) Find the cdf and inverse cdf corresponding to g(x). (This will allow us to 
simulate values from the candidate distribution.) 

(b) Find the smallest majorization constant c so that f(x)/g(x) <c for all x, 
where f(x) is the standard normal pdf. [Hint: Use calculus to determine 
where the ratio /(x)/g(x) is maximized.] 

(c) On the average, how many candidate values will be required to generate 
10,000 “accepted” values? 

(d) Write a program to construct 10,000 values from a standard normal 
distribution. 

(e) Suppose that you now wish to simulate from a N(u, o) distribution. How 
would you modify your program in part (d)? 
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139. 


3.9 


140. 


141. 


142. 


143. 


Explain why the majorization constant c in the accept-reject algorithm must 
be > 1. [Hint: If c < 1, then f(x) < g(x) for all x. Why is this bad?] 


Supplementary Exercises (140-172) 


An insurance company issues a policy covering losses up to 5 (in thousands of 
dollars). The loss, X, follows a distribution with density function: 


= x>1 
fxaqe 
0 x<il 


What is the expected value of the amount paid under the policy? 

Let X =the time it takes a read/write head to locate a desired record on a 
computer disk memory device once the head has been positioned over the 
correct track. If the disks rotate once every 25 msec, a reasonable assumption 
is that X is uniformly distributed on the interval [0, 25]. 

(a) Compute P(10 < X < 20). 

(b) Compute P(X > 10). 

(c) Obtain the cdf F(X). 

(d) Compute E(X) and SD(X). 

A 12-in. bar clamped at both ends is subjected to an increasing amount of 
stress until it snaps. Let Y= the distance from the left end at which the break 
occurs. Suppose Y has pdf 


x y 
poy={ali-g) 0s98” 


0 otherwise 


Compute the following: 

(a) The cdf of Y, and graph it. 

(b) P(Y <4), P(Y > 6), and P4<Y<6). 

(c) E(Y), E(Y’), and SD(Y). 

(d) The probability that the break point occurs more than 2 in. from the 
expected break point. 

(e) The expected length of the shorter segment when the break occurs. 

Let X denote the time to failure (in years) of a hydraulic component. Suppose 

the pdf of X is f(x) =32/a+ 4) for x >0. 

(a) Verify that f(x) is a legitimate pdf. 

(b) Determine the cdf. 

(c) Use the result of part (b) to calculate the probability that time to failure is 
between 2 and 5 years. 

(d) What is the expected time to failure? 
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(e) If the component has a salvage value equal to 100/(4+.x) when its time to 
failure is x, what is the expected salvage value? 
144. The completion time X for a task has cdf F(x) given by 


0 x<0 
3 

— O0<x<1 
1/7 7-3 7. 
1 x x l<x<- 
2\3 4 4 3 

1 ee 

3 


(a) Obtain the pdf f(x) and sketch its graph. 

(b) Compute P(.5 <X <2). 

(c) Compute E(X). 

145. The breakdown voltage of a randomly chosen diode of a certain type is known 

to be normally distributed with mean value 40 V and standard deviation 1.5 V. 

(a) What is the probability that the voltage of a single diode is between 39 and 
42? 

(b) What value is such that only 15% of all diodes have voltages exceeding 
that value? 

(c) If four diodes are independently selected, what is the probability that at 
least one has a voltage exceeding 42? 

146. The article “Computer Assisted Net Weight Control” (Qual. Prog., 1983: 22- 

25) suggests a normal distribution with mean 137.2 oz and standard deviation 

1.6 oz, for the actual contents of jars of a certain type. The stated contents was 

135 oz. 

(a) What is the probability that a single jar contains more than the stated 
contents? 

(b) Among ten randomly selected jars, what is the probability that at least 
eight contain more than the stated contents? 

(c) Assuming that the mean remains at 137.2, to what value would the 
standard deviation have to be changed so that 95% of all jars contain 
more than the stated contents? 

147. When circuit boards used in the manufacture of compact disk players are 
tested, the long-run percentage of defectives is 5%. Suppose that a batch of 

250 boards has been received and that the condition of any particular board is 

independent of that of any other board. 

(a) What is the approximate probability that at least 10% of the boards in the 
batch are defective? 

(b) What is the approximate probability that there are exactly 10 defectives in 
the batch? 
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148. 


149. 


150. 


151. 


The article “Reliability of Domestic-Waste Biofilm Reactors” (J. Envir. 

Engr., 1995: 785-790) suggests that substrate concentration (mg/cm?) of 

influent to a reactor is normally distributed with «= .30 and o = .06. 

(a) What is the probability that the concentration exceeds .25? 

(b) What is the probability that the concentration is at most .10? 

(c) How would you characterize the largest 5% of all concentration values? 

Let X=the hourly median power (in decibels) of received radio signals 

transmitted between two cities. The authors of the article “Families of 

Distributions for Hourly Median Power and Instantaneous Power of Received 

Radio Signals” (J. Res. Nat. Bureau Standards, vol. 67D, 1963: 753-762) 

argue that the lognormal distribution provides a reasonable probability model 

for X. If the parameter values are “ = 3.5 and o = 1.2, calculate the following: 

(a) The mean value and standard deviation of received power. 

(b) The probability that received power is between 50 and 250 dB. 

(c) The probability that X is less than its mean value. Why is this probability 
not .5? 

Let X be a nonnegative continuous random variable with cdf F(x) and mean 

E(x). 


(a) The definition of expected value is E(X) = | xf(x)dx. Replace the first 
0 


Xx 


x inside the integral with | 1 dy to create a double integral expression for 


0 
E(X). [The “order of integration” should be dy dx.] 


(b) Rearrange the order of integration, keeping track of the revised limits of 
integration, to show that 


Bx) =| | poajauay 


0 Jy 


CO 


(c) Evaluate the dx integral in (b) to show that E(X) = | [1 — F(y)]dy. (This 
0 
provides an alternate derivation of the formula established in Exercise 38.) 


(d) Use the result of (c) to verify that the expected value of an exponentially 
distributed rv with parameter J is 1/A. 

The reaction time (in seconds) to a stimulus is a continuous random variable 

with pdf 


3 
foya dae 1 S*S° 
0 otherwise 


(a) Obtain the cdf. 

(b) What is the probability that reaction time is at most 2.5 s? Between 1.5 
and 2.5 s? 

(c) Compute the expected reaction time. 
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(d) Compute the standard deviation of reaction time. 

(e) If an individual takes more than 1.5 s to react, a light comes on and stays 
on either until one further second has elapsed or until the person reacts 
(whichever happens first). Determine the expected amount of time that the 
light remains lit. [Hint: Let h(X)=the time that the light is on as a 
function of reaction time X.] 

The article “Characterization of Room Temperature Damping in Aluminum- 

Indium Alloys” (Metallurgical Trans., 1993: 1611-1619) suggests that alumi- 

num matrix grain size (um) for an alloy consisting of 2% indium could be 

modeled with a normal distribution with mean 96 and standard deviation 14. 

(a) What is the probability that grain size exceeds 100 pm? 

(b) What is the probability that grain size is between 50 and 80 pm? 

(c) What interval (a, b) includes the central 90% of all grain sizes (so that 5% 
are below a and 5% are above b)? 

The article “Determination of the MTF of Positive Photoresists Using the 

Monte Carlo Method” (Photographic Sci. Engrg., 1983: 254-260) proposes 

the exponential distribution with parameter 1 = .93 as a model for the distri- 

bution of a photon’s free path length (mm) under certain circumstances. 

Suppose this is the correct model. 

(a) What is the expected path length, and what is the standard deviation of 
path length? 

(b) What is the probability that path length exceeds 3.0? What is the proba- 
bility that path length is between 1.0 and 3.0? 

(c) What value is exceeded by only 10% of all path lengths? 

The article “The Prediction of Corrosion by Statistical Analysis of Corrosion 

Profiles” (Corrosion Sci., 1985: 305-315) suggests the following cdf for the 

depth X of the deepest pit in an experiment involving the exposure of carbon 

manganese steel to acidified seawater: 
—e~ 91) /% 
F(x; 01,02) =e —-wo<x<o 
(This is called the largest extreme value distribution or Gumbel distribu- 
tion.) The investigators proposed the values 6; = 150 and 8,;=90. Assume 
this to be the correct model. 

(a) What is the probability that the depth of the deepest pit is at most 150? At 
most 300? Between 150 and 300? 

(b) Below what value will the depth of the maximum pit be observed in 90% 
of all such experiments? 

(c) What is the density function of X? 

(d) The density function can be shown to be unimodal (a single peak). Above 
what value on the measurement axis does this peak occur? (This value is 
the mode.) 

(e) It can be shown that E(X) = .577202 + 0,. What is the mean for the given 
values of 0; and 02, and how does it compare to the median and mode? 
Sketch the graph of the density function. 
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Let t=the amount of sales tax a retailer owes the government for a certain 

period. The article “Statistical Sampling in Tax Audits” (Statistics and the 

Law, 2008: 320-343) proposes modeling the uncertainty in f by regarding it as 

a normally distributed random variable with mean value yw and standard 

deviation o (in the article, these two parameters are estimated from the results 

of a tax audit involving n sampled transactions). If a represents the amount the 
retailer is assessed, then an underassessment results if tf >a and an overas- 
sessment if a> +t. We can express this in terms of a loss function, a function 
that shows zero loss if t= a but increases as the gap between ¢ and a increases. 

The proposed loss function is L(a, t)=t—a if t>a and=k(a—?t) if t<a 

(k>1 is suggested to incorporate the idea that over-assessment is more 

serious than under-assessment). 

(a) Show that a*=y+o®- ‘(kt 1)) is the value of a that minimizes the 
expected loss, where ®~ ' is the inverse function of the standard 
normal cdf. 

(b) If k= 2 (suggested in the article), p = $100,000, and co = $10,000, what is 
the optimal value of a, and what is the resulting probability of over- 
assessment? 

A mode of a continuous distribution is a value x* that maximizes f(x). 

(a) What is the mode of a normal distribution with parameters p and o? 

(b) Does the uniform distribution with parameters A and B have a single 
mode? Why or why not? 

(c) What is the mode of an exponential distribution with parameter 1? (Draw 
a picture.) 

(d) If X has a gamma distribution with parameters a and f, and a > 1, find the 
mode. [Hint: In[f(x)] will be maximized if and only if f(x) is, and it may be 
simpler to take the derivative of In[/(x)].] 

The article “Error Distribution in Navigation” (J. Institut. Navigation, 1971: 

429-442) suggests that the frequency distribution of positive errors 

(magnitudes of errors) is well approximated by an exponential distribution. 

Let X = the lateral position error (nautical miles), which can be either negative 

or positive. Suppose the pdf of X is 


f(x) = 1e77# —0 <x< 0 


(a) Sketch a graph of f(x) and verify that f(x) is a legitimate pdf (show that it 
integrates to 1). 

(b) Obtain the cdf of X and sketch it. 

(c) Compute P(X <0), P(X <2), P(—1<X <2), and the probability that an 
error of more than 2 miles is made. 

The article “Statistical Behavior Modeling for Driver-Adaptive Precrash 

Systems” (EEE Trans. on Intelligent Transp. Systems, 2013: 1-9) proposed 

the following distribution for modeling the behavior of what the authors called 

“the criticality level of a situation” X. 
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3 Continuous Random Variables and Probability Distributions 


(eshysdo,p) = {PAC + (1 —pyhe™ x30 
F(x541,42,p) = { 0 otherwise 


This is often called the hyperexponential or mixed exponential distribution. 

(a) What is the cdf F(x; 41, 42, p)? 

(b) Ifp=.5, 2, =40, A2 = 200 (values of the As suggested in the cited article), 
calculate P(X > .01). 

(c) If X has f(x; 21, 22, p) as its pdf, what is E(x)? 

(d) Using the fact that E(X’) = 2/A? when X has an exponential distribution 
with parameter 1, compute E(X’) when X has pdf fa; 41, J2, p). Then 
compute Var(X). 

(e) The coefficient of variation of a random variable (or distribution) is 
CV =o/p. What is the CV for an exponential rv? What can you say 
about the value of CV when X has a hyperexponential distribution? 

(f) What is the CV for an Erlang distribution with parameters 4 and n as 
defined in Sect. 3.4? [Note: In applied work, the sample CV is used to 
decide which of the three distributions might be appropriate. ] 

(g) For the parameter values given in (b), calculate the probability that X is 
within one standard deviation of its mean value. Does this probability 
depend upon the values of the As (it does not depend on 4 when X has an 
exponential distribution)? 

Suppose a state allows individuals filing tax returns to itemize deductions only 

if the total of all itemized deductions is at least $5,000. Let X (in 1000s of 

dollars) be the total of itemized deductions on a randomly chosen form. 

Assume that X has the pdf 


yf ke /xt x>5 
fla) = { 0 otherwise 


(a) Find the value of k. What restriction on a is necessary? 

(b) What is the cdf of X? 

(c) What is the expected total deduction on a randomly chosen form? What 
restriction on a is necessary for E(X) to be finite? 

(d) Show that In(X/5) has an exponential distribution with parameter a — 1. 

Let J; be the input current to a transistor and /, be the output current. Then the 

current gain is proportional to In(/,//;). Suppose the constant of proportionality 

is | (which amounts to choosing a particular unit of measurement), so that 

current gain = X = In(/,/I;). Assume X is normally distributed with 4 = 1 and 

o=.05. 

(a) What type of distribution does the ratio /,/I; have? 

(b) What is the probability that the output current is more than twice the input 
current? 

(c) What are the expected value and variance of the ratio of output to input 
current? 
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The article “Response of SiC;/SizN, Composites Under Static and Cyclic 

Loading—An Experimental and Statistical Analysis” (J. Engr. Materials 

Tech., 1997: 186-193) suggests that tensile strength (MPa) of composites 

under specified conditions can be modeled by a Weibull distribution with 

a=9 and f= 180. 

(a) Sketch a graph of the density function. 

(b) What is the probability that the strength of a randomly selected specimen 
will exceed 175? Will be between 150 and 175? 

(c) If two randomly selected specimens are chosen and their strengths are 
independent of each other, what is the probability that at least one has 
strength between 150 and 175? 

(d) What strength value separates the weakest 10% of all specimens from the 
remaining 90%? 

(a) Suppose the lifetime X of a component, when measured in hours, has a 
gamma distribution with parameters a and /. Let Y = lifetime measured in 
minutes. Derive the pdf of Y. 

(b) If X has a gamma distribution with parameters a and /, what is the 
probability distribution of Y= cx? 

Based on data from a dart-throwing experiment, the article “Shooting Darts” 

(Chance, Summer 1997: 16-19) proposed that the horizontal and vertical 

errors from aiming at a point target should be independent of each other, 

each with a normal distribution having mean 0 and standard deviation o. It can 
then be shown that the pdf of the distance V from the target to the landing 
point is 

fo =e) vs0 

(a) This pdf is a member of what family introduced in this chapter? 

(b) If c=20 mm (close to the value suggested in the paper), what is the 
probability that a dart will land within 25 mm (roughly | in.) of the target? 

The article “Three Sisters Give Birth on the Same Day” (Chance, Spring 

2001: 23—25) used the fact that three Utah sisters had all given birth on March 

11, 1998, as a basis for posing some interesting questions regarding birth 

coincidences. 

(a) Disregarding leap year and assuming that the other 365 days are equally 
likely, what is the probability that three randomly selected births all occur 
on March 11? Be sure to indicate what, if any, extra assumptions you are 
making. 

(b) With the assumptions used in part (a), what is the probability that three 
randomly selected births all occur on the same day? 

(c) The author suggested that, based on extensive data, the length of gestation 
(time between conception and birth) could be modeled as having a normal 
distribution with mean value 280 days and standard deviation 19.88 days. 
The due dates for the three Utah sisters were March 15, April 1, and April 
4, respectively. Assuming that all three due dates are at the mean of the 
distribution, what is the probability that all births occurred on March 11? 
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3 Continuous Random Variables and Probability Distributions 


[Hint: The deviation of birth date from due date is normally distributed 

with mean 0.] 
(d) Explain how you would use the information in part (c) to calculate the 

probability of a common birth date. 
Exercise 49 introduced two machines that produce wine corks, the first one 
having a normal diameter distribution with mean value 3 cm and standard 
deviation .1 cm and the second having a normal diameter distribution with 
mean value 3.04 cm and standard deviation .02 cm. Acceptable corks have 
diameters between 2.9 and 3.1 cm. If 60% of all corks used come from the first 
machine and a randomly selected cork is found to be acceptable, what is the 
probability that it was produced by the first machine? 
A function g(x) is convex if the chord connecting any two points on the 
function’s graph lies above the graph. When g(x) is differentiable, an equivalent 
condition is that for every x, the tangent line at x lies entirely on or below the 
graph. (See the figure below.) How does g(u)=g[E(X)] compare to the 
expected value E[g(X)]? [Hint: The equation of the tangent line at x=y is 
y=g(u) + 9'(u)- («— p). Use the condition of convexity, substitute X for x, and 
take expected values. Note: Unless g(x) is linear, the resulting inequality 
(usually called Jensen’s inequality) is strict (<rather than <); it is valid for 
both continuous and discrete rvs.] 


A 


Chord 


\ Tangent 
line 


! > 
x 


Let X have a Weibull distribution with parameters a=2 and /. Show that 
Y = 2X?/p" has an exponential distribution with 4 = 1/2. 

Let X have the pdf fa) = 1/[z1 +x)] for —co <x <0 (a central Cauchy 
distribution), and show that Y = 1/X has the same distribution. [Hint: Consider 
P(Y|<y), the cdf of IYI, then obtain its pdf and show it is identical to the pdf of 
IX1.] 

Let X have a Weibull distribution with shape parameter @ and scale parameter 
f. Show that the transformed variable Y = In(X) has an extreme value distribu- 
tion as defined in Section 3.6, with 6; = In(f) and 62 = I/a. 

A store will order g gallons of a liquid product to meet demand during a 
particular time period. This product can be dispensed to customers in any 
amount desired, so demand during the period is a continuous random variable 
X with cdf F(x). There is a fixed cost co for ordering the product plus a cost of 
c; per gallon purchased. The per-gallon sale price of the product is d. Liquid 
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left unsold at the end of the time period has a salvage value of e per gallon. 
Finally, if demand exceeds gq, there will be a shortage cost for loss of goodwill 
and future business; this cost is f per gallon of unfulfilled demand. Show that 
the value of g that maximizes expected profit, denoted by q*, satisfies 


d-—-ca+f 
d—e+f 


P(satisfying demand) = F(q* ) = 


Then determine the value of F(q*) if d=$35, co = $25, c; =$15, e=$5, 
and f= $25. [Hint: Let x denote a particular value of X. Develop an expression 
for profit when x < g and another expression for profit when x > gq. Now write 
an integral expression for expected profit (as a function of gq) and 
differentiate. | 
An individual’s credit score is a number calculated based on that person’s 
credit history that helps a lender determine how much s/he should be loaned or 
what credit limit should be established for a credit card. An article in the Los 
Angeles Times gave data which suggested that a beta distribution with 
parameters A= 150, B=850, a=8, f=2 would provide a reasonable 
approximation to the distribution of American credit scores. [Note: credit 
scores are integer-valued. ] 

(a) Let X represent a randomly selected American credit score. What are the 
mean and standard deviation of this random variable? What is the proba- 
bility that X is within | standard deviation of its mean? 

(b) What is the approximate probability that a randomly selected score will 
exceed 750 (which lenders consider a very good score)? 

Let V denote rainfall volume and W denote runoff volume (both in mm). 

According to the article “Runoff Quality Analysis of Urban Catchments with 

Analytical Probability Models” (J. of Water Resource Planning and Manage- 

ment, 2006: 4-14), the runoff volume will be 0 if V< vg and will be k(V — vq) 

if V> vg. Here vz is the volume of depression storage (a constant), and k (also 

a constant) is the runoff coefficient. The cited article proposes an exponential 

distribution with parameter A for V. 

(a) Obtain an expression for the cdf of W. [Note: W is neither purely continu- 
ous nor purely discrete; instead it has a “mixed” distribution with a 
discrete component at 0 and is continuous for values w > 0.] 

(b) What is the pdf of W for w > 0? Use this to obtain an expression for the 
expected value of runoff volume. 


Joint Probability Distributions and Their 4 
Applications 


In Chaps. 2 and 3, we studied probability models for a single random variable. 
Many problems in probability and statistics lead to models involving several 
random variables simultaneously. For example, we might consider randomly 
selecting a college student and defining X =the student’s high school GPA and 
Y =the student’s college GPA. In this chapter, we first discuss probability models 
for the joint behavior of several random variables, putting special emphasis on the 
case in which the variables are independent of each other. We then study expected 
values of functions of several random variables, including covariance and correla- 
tion as measures of the degree of association between two variables. 

Many problem scenarios involve linear combinations of random variables. For 
example, suppose an investor owns 100 share of one stock and 200 shares of another. 
If X, and X> are the prices per share of the two stocks, then the value of investor’s 
portfolio is 100X, + 200X>. Sections 4.3 and 4.5 enumerate the properties of linear 
combinations of random variables, including the celebrated Central Limit Theorem 
(CLT), which characterizes the behavior of a sum X, +X +...+X,, as n increases. 

The fifth section considers conditional distributions, the distributions of some 
random variables given the values of other random variables, e.g., the distribution 
of fuel efficiency conditional on the weight of a vehicle. 

In Sect. 3.7, we developed methods for obtaining the distribution of some 
function g(X) of a random variable. Section 4.6 extends these ideas to 
transformations of two or more rvs. For example, if X and Y are the scores on a 
two-part exam, we might be interested in the total score X+ Y and also X/(X +Y), 
the proportion of total points achieved on the first part. 

The chapter ends with sections on the bivariate normal distribution (Sect. 4.9), 
the reliability of devices and systems (Sect. 4.8), “order statistics” such as the 
median and range obtained by ordering sample observations from smallest to 
largest (Sect. 4.9), and simulation techniques for jointly distributed random 
variables (Sect. 4.10). 
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4.1 Jointly Distributed Random Variables 


There are many experimental situations in which more than one random variable 
(tv) will be of interest to an investigator. For example X might be the number of 
books checked out from a public library on a particular day and Y the number of 
videos checked out on the same day. Or X and Y might be the height and weight, 
respectively, of a randomly selected adult. In general, the two rvs of interest could 
both be discrete, both be continuous, or one could be discrete and the other 
continuous. In practice, the two “pure” cases—both of the same type—predomi- 
nate. We shall first consider joint probability distributions for two discrete rvs, then 
for two continuous variables, and finally for more than two variables. 


4.1.1 The Joint Probability Mass Function for Two Discrete 
Random Variables 


The probability mass function (pmf) of a single discrete rv X specifies how much 
probability mass is placed on each possible X value. The joint pmf of two discrete 
rvs X and Y describes how much probability mass is placed on each possible pair of 
values (x, y). 


DEFINITION 

Let X and Y be two discrete rvs defined on the sample space 8 of an experi- 
ment. The joint probability mass function p(x, y) is defined for each pair of 
numbers (x, y) by 


p(x,y) = P(X =xand Y = y) 


A function p(x, y) can be used as a joint pmf provided that p(x, y) > 0 for all x and 
yand »' » » yp@, y) = 1. Let A be any set consisting of pairs of (x, y) values, such as 
{(x, y): x+y <10}. Then the probability that the random pair (X, Y) lies in A is 
obtained by summing the joint pmf over pairs in A: 


P((X,Y) GA) = 2, Yt) 


(xy)EA 


Example 4.1 A large insurance agency services a number of customers who have 
purchased both a homeowner’s policy and an automobile policy from the agency. 
For each type of policy, a deductible amount must be specified. For an automobile 
policy, the choices are $100 and $250, whereas for a homeowner’s policy, the 
choices are 0, $100, and $200. Suppose an individual with both types of policy is 
selected at random from the agency’s files. Let X =the deductible amount on the 
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auto policy and Y= the deductible amount on the homeowner’s policy. Possible 
(X, Y) pairs are then (100, 0), (100, 100), (100, 200), (250, 0), (250, 100), and 
(250, 200); the joint pmf specifies the probability associated with each one of these 
pairs, with any other pair having probability zero. Suppose the joint pmf is as given 
in the accompanying joint probability table: 


y 
poy) | 0 100 200 
100 20 10 20 
250 05 15 30 


Then p(100, 100)=P(X=100 and Y=100)=P($100 deductible on both 
policies) = .10. The probability P(Y > 100) is computed by summing probabilities 
of all (x, y) pairs for which y > 100: 


P(Y > 100) = p(100, 100) + p(250, 100) + p(100, 200) + p(250, 200) = .75 


It should be obvious from the preceding example that a probability such as 
P(Y=0), i.e., py(O), results from summing p(x, 0) over all possibly x values. More 
generally the pmf of Y is obtained by fixing the value of y in turn at each possible 
value and summing p(x, y) over all values of x. The pmf of X can be obtained by 
analogous summation. The result is called a marginal pmf, because when the p(x, y) 
values appear in a rectangular table, the sums are just marginal (row or column) 
totals. 


DEFINITION 
The marginal probability mass functions of X and of Y, denoted by p x(x) 
and py(y), respectively, are given by 


Px(x) = Y Py) Pr) = psy) 


Thus to obtain the marginal pmf of X evaluated at, say, x = 100, the probabilities 
p(100, y) are added over all possible y values. Doing this for each possible X value 
gives the marginal pmf of X alone (i.e., without reference to Y). From the marginal 
pmfs, probabilities of events involving only X or only Y can be computed. 


Example 4.2 (Example 4.1 continued) The possible X values are x= 100 and 
x = 250, so computing row totals in the joint probability table yields 
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py(100) = p(100, 0) + p(100, 100) + p(100, 200) = .50 


Px(250) = p(250, 0) + p(250, 100) + p(250, 200) = .50 
The marginal pmf of X is then 


wy {5 ¥= 100,250 
PX) =) 9 otherwise 


Similarly, the marginal pmf of Y is obtained from column totals as 


25 y = 0, 100 
Py(y)=4 50 y= 200 
0 otherwise 


so P(Y > 100) = py(100) + py(200) = .75 as before. | 


4.1.2. The Joint Probability Density Function for Two Continuous 
Random Variables 


The probability that the observed value of a continuous rv X lies in a 
one-dimensional set A (such as an interval) is obtained by integrating the pdf f(x) 
over the set A. Similarly, the probability that the pair (X, Y) of continuous rvs falls 
in a two-dimensional set A (such as a rectangle) is obtained by integrating a 
function called the joint density function. 


DEFINITION 
Let X and Y be continuous rvs. Then f(x, y) is the joint probability density 
function for X and Y if for any two-dimensional set A, 


P(xX,¥) =A) = | [Flxy)dedy 
A 


In particular, if A is the two-dimensional rectangle {(x, y): a<x<b, 
c<y<d}, then 


b pd 
POY) =Al= Pie <X<bc<¥<d)=| | f(x, y)dydx 


a tes 
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Fig. 4.1 P(X, Y) © A) = Ss y) y 
volume under density surface A 
above A 


Surface f(x, y) 


A = Shaded 
rectangle 


For f(x, y) to be a candidate for a joint pdf, it must satisfy f(x, y)>0 and 
J. JS. fee y)dxdy = 1. We can think of f(x, y) as specifying a surface at height 
fix, y) above the point (x, y) in a three-dimensional coordinate system. Then 
P(X, Y) © A) is the volume underneath this surface and above the region A, 
analogous to the area under a curve in the one-dimensional case. This is illustrated 
in Fig. 4.1, 


Example 4.3 A bank operates both a drive-up facility and a walk-up window. Ona 
randomly selected day, let X = the proportion of time that the drive-up facility is in 
use (at least one customer is being served or waiting to be served) and Y= the 
proportion of time that the walk-up window is in use. Then the set of possible values 
for (X, Y) is the rectangle {(x, y): O<x<1,0<y<1}. Suppose the joint pdf of 
(X, Y) is given by 


Pes o(x+y) = xe 1, VS ysl 
f(%y) = 


0 otherwise 


To verify that this is a legitimate pdf, note that f(x, y) > 0 and 


lee) ne 1 pl 6 1 fl 6 1 pl 6 
| | f(x, y)dxdy = | | 5 (x+ y’)dxdy = | | Sxddy+| | 5) dedy 


—oo J—00 0 JO 0 JO 0 JO 
1 1 
6 6 6 6 
=| —xd -ydy =—+—=1 
ie. | 2 >= 10715 


The probability that neither facility is busy more than one-quarter of the time is 
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1 1 1/4 1/46 
P{O<X<7,0<Y<7 =| | 5 (e+ y")dedy 
0 


4 
6 1/4 p1/4 6 1/4 p1/4 
= | | xdxdy +3 | ydxdy 
0 5 0 


SJo Jo 0 
x=1/4 y=1/4 
6 x 6 y 7 
— . = —~ = .0109 
20 2 20 3 640 
x=0 y=0 a 


The marginal pmf of one discrete variable results from summing the joint pmf 
over all values of the other variable. Similarly, the marginal pdf of one continuous 
variable is obtained by integrating the joint pdf over all values of the other variable. 


DEFINITION 
The marginal probability density functions of X and Y, denoted by fx(x) and 
Jv(y), respectively, are given by 


Fes) =[|_ flay)dy for -00 << 00 
fyY) = | f(x,y)dx for—co < y < co 
Example 4.4 (Example 4.3 continued) The marginal pdf of X, which gives the 


probability distribution of busy time for the drive-up facility without reference to 
the walk-up window, is 


a '6 ; 6 2 
fra) =| fy)dy=] z@t+y’)dy=ex+e 
—0o 0 5 5 5 
for 0<x< 1 and 0 otherwise. The marginal pdf of Y is 
6, 3 
gt: Ysysi 
fry=4s 5 
0 otherwise 


Then, for example, 
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1 3 3/4 6 3 37 
P(-<Y<_\= 2 dy = —= .4625 a 
& <3) li, (> +3) *~ 80 


In Example 4.3, the region of positive joint density was a rectangle, which made 
computation of the marginal pdfs relatively easy. Consider now an example in 
which the region of positive density is a more complicated figure. 


Example 4.5 A nut company markets cans of deluxe mixed nuts containing 
almonds, cashews, and peanuts. Suppose the net weight of each can is exactly 
1 lb, but the weight contribution of each type of nut is random. Because the three 
weights sum to 1, a joint probability model for any two gives all necessary 
information about the weight of the third type. Let X = the weight of almonds in 
a selected can and Y = the weight of cashews. Then the region of positive density is 
D={Q@, y): O<x<1,0<y<1,x+y< 1}, the shaded region pictured in Fig. 4.2. 
Now let the joint pdf for (X, Y) be 


oy f 24xy Gea, Oxyel wepsl 
f(x,y) = { 0 otherwise 


For any fixed x, f(x, y) increases with y; for fixed y, f(x, y) increases with x. This is 
appropriate because the word deluxe implies that most of the can should consist of 
almonds and cashews rather than peanuts, so that the density function should be 
large near the upper boundary and small near the origin. The surface determined by 
J, y) slopes upward from zero as (x, y) moves away from either axis. 

Clearly, f(x, y) >0. To verify the second condition on a joint pdf, recall that a 
double integral is computed as an iterated integral by holding one variable fixed 
(such as x as in Fig. 4.2), integrating over values of the other variable lying along 
the straight line passing through the value of the fixed variable, and finally 
integrating over all possible values of the fixed variable. Thus 


Fig. 4.2 Region of positive y 
density for Example 4.5 
(0, 1) 
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Fig. 4.3 Computing 4 
P((X, Y) © A) for 1 
Example 4.5 
A= Shaded region 
| 
y=.5-x 
0 > 
0 x 5 1 
oe) oe) 1 1—-x 
[| res»yaar= | [rts sayar = | {| 2aryay 
—co J—oo rs 0 0 
y=l—-x 
1 y 1 
= | 7 dx = | 12x(1 —x)’dx = 1 
0 2 0 


y=0 


To compute the probability that the two types of nuts together make up at most 
50% of the can, let A= {(x, y): O<x<1,0<y<1, andx+y<.5}, as shown in 
Fig. 4.3. Then 


5S-x 


P((X,Y)€A) = | [rt y)dxdy = [ | 24xydydx = .0625 
A 


The marginal pdf for almonds is obtained by holding X fixed at x (again, as in 
Fig. 4.2) and integrating f(x, y) along the vertical line through x: 


1-—x 


= 24xydy = 12x(1 — x)? <x<l 
fats) =| Flx,y)dy = | 5 eal, 
me 0 otherwise 


By symmetry of f(x, y) and the region D, the marginal pdf of Y is obtained by 
replacing x and X in fy(x) by y and Y, respectively. a 


4.1.3. Independent Random Variables 


In many situations, information about the observed value of one of the two variables 
X and Y gives information about the value of the other variable. In Example 4.1, the 
marginal probability of X at x = 250 was .5, as was the probability that X = 100. If, 


4.1 Jointly Distributed Random Variables 295 


however, we are told that the selected individual had Y=0, then X = 100 is four 
times as likely as X = 250. Thus there is a dependence between the two variables. 

In Chap. | we pointed out that one way of defining independence of two events is 
to say that A and B are independent if P(A NB) = P(A) - P(B). Here is an analogous 
definition for the independence of two rvs. 


DEFINITION 
Two random variables X and Y are said to be independent if for every pair of 
x and y values, 


p(x, y) = px(x) - py(y)when X and Y are discrete 
or (4.1) 
f(x,y) =fy(x) -fy(y)when X and Y are continuous 


If Eq. (4.1) is not satisfied for all (x, y), then X and Y are said to be dependent. 


The definition says that two variables are independent if their joint pmf or pdf is 
the product of the two marginal pmfs or pdfs. 


Example 4.6 In the insurance situation of Examples 4.1 and 4.2, 
p(100, 100) = .10 ¥ (.5)(.25) = py (100) - py(100) 


so X and Y are not independent. Independence of X and Y requires that every entry in 
the joint probability table be the product of the corresponding row and column 
marginal probabilities. = 


Example 4.7 (Example 4.5 continued) Because f(x, y) in the nut scenario has the 
form of a product, X and Y would appear to be independent. However, although 
fx) =fr) =2. £3.32) =0 42-4, so the variables are not in fact indepen- 
dent. To be independent, f(x, y) must have the form g(x)-h(y) and the region of 
positive density must be a rectangle whose sides are parallel to the coordinate axes. 

a 


Independence of two random variables most often arises when the description of 
the experiment under study tells us that X and Y have no effect on each other. Then 
once the marginal pmfs or pdfs have been specified, the joint pmf or pdf is simply 
the product of the two marginal functions. It follows that 
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Pia<X<bc<Y¥<d)=Pla<X <b)-P(c<Y¥<d) 


Example 4.8 Suppose that the lifetimes of two components are independent of 
each other and that the first lifetime, X,, has an exponential distribution with 
parameter J, whereas the second, X>, has an exponential distribution with parameter 
A. Then the joint pdf is 


Ff (*1,%2) = fy, («1) -Fx, 2) 
wen Age 2 = Aye *-he x, > 0, > 0 


0) otherwise 


Let 2; = 1/1000 and Az = 1/1200, so that the expected lifetimes are 1000 h and 
1200 h, respectively. The probability that both component lifetimes are at least 
1500 h is 


P(X; > 1500,X» > 1500) = P(X; > 1500) - P(X, > 1500) 


OO io @) 
= | ae™ dx, - | Age? dxy 
1500 1500 


= e41(1500) . e-4a(1500) — (2231) (.2865) = .0639 


The probability that the sum of their lifetimes, X; +X, is at most 3000 h requires 
a double integral of the joint pdf: 


3000 ¢3000—x> 
| Ff (41, X2)dx1dx2 


P(X; + X2 < 3000) = P(X; < 3000 — X2) = | 
0 


3000 ¢3000—X> 3000 3000— 
—Ayx1—Anx2 —Apx) Ax, | > %2 
= AyaAge “1! dx dx. = Age “?? | —e dx2 
0 


0 


0 0 0 
3000 

= | [ane a a ae dx) = .7564 = 
0 


4.1.4 More Than Two Random Variables 


To model the joint behavior of more than two random variables, we extend the 
concept of a joint distribution of two variables. 
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DEFINITION 
If X,, Xo, ..., X, are all discrete random variables, the joint pmf of the 
variables is the function 


CA IAy cos 9dta) =/ACG = sh NLS = 2M) ooo Mp = sea) 


If the variables are continuous, the joint pdf of X,, X2, ..., X, is the 
function f(, x2, ..., X,) such that for any n intervals [a;, bj], ..., [ay, Onl, 


by by 
P(ay <X, Sis, caaalh <Xy < by) = | | PICige ss kip Cbg os oBh 


aq an 


Example 4.9 A binomial experiment consists of n dichotomous (success-failure), 
homogenous (constant success probability) independent trials. Now consider a 
trinomial experiment in which each of the n trials can result in one of three possible 
outcomes. For example, each successive customer at a store might pay with cash, a 
credit card, or a debit card. The trials are assumed independent. Let p; = P(trial 
results in a type | outcome) and define p, and p3 analogously for type 2 and type 
3 outcomes. The random variables of interest here are X;= the number of trials that 
result in a type i outcome for i= 1, 2, 3. 

In n= 10 trials, the probability that the first five are type 1 outcomes, the next 
three are type 2, and the last two are type 3—1.e., the probability of the experimental 
outcome 1111122233—is p? . p3 . ps . This is also the probability of the outcome 
1122311123, and in fact the probability of any outcome that has exactly five 1s, 
three 2s, and two 3s. Now to determine the probability P(X,;=5, X.=3, and 
X3=2), we have to count the number of outcomes that have exactly five Is, three 


“ 10 ; 
2s, and two 3s. First, there are ( : ) ways to choose five of the trials to be the type 
1 outcomes. Now from the remaining five trials, we choose three to be the type 
: ee A ; : pg 
2 outcomes, which can be done in (;) ways. This determines the remaining two 


trials which consist of type 3 outcomes. So the total number of ways of choosing 
five 1s, three 2s, and two 3s is 


10\ /5\ 10! 5! 10! 
ere 5151312) si3ian 


Thus we see that P(X, =5,X>=3,X3 =2) = 2520p; - p3 - ps. Generalizing this 
to n trials gives 


nl 
P(*1,%2,%3) = PAX = 1,22,X) = 22,X — 25) = Parser ai Py 
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for x; =0, 1, 2,...3x%.=0,1,2,...3x3=0,1,2,... such that x) +x. +.x3 =n. Notice 
that whereas there are three random variables here, the third variable X3 is actually 
redundant, because for example in the case n= 10, having X,;=5 and X,=3 
implies that X3=2 (just as in a binomial experiment there are actually two rvs— 
the number of successes and number of failures—but the latter is redundant). 

As an example, the genotype of a pea section can be either AA, Aa, or aa. A 
simple genetic model specifies P(AA) = .25, P(Aa) =.50, and P(aa) =.25. If the 
alleles of ten independently obtained sections are determined, the probability that 
exactly five of these are Aa and two are AA is 


10! 


)= asta 25)" (-50)°(.25)° = .0769 . 


p(2,5,3 


The trinomial scenario of Example 4.9 can be generalized by considering a 
multinomial experiment consisting of n independent and identical trials, in which 
each trial can result in any one of r possible outcomes. Let p; = P(outcome / on any 
particular trial), and define random variables by X; =the number of trials resulting 
in outcome i ((=1,..., r). The joint pmf of X,, ..., X, is called the multinomial 
distribution. An argument analogous to what was done in Example 4.9 gives the 
joint pmf of X,, ..., X,: 


Sic p,’ forx;=0,1,2,...withx; +---+x,=n 
x 
0 otherwise 


The case r=2 reduces to the binomial distribution, with X,=number of 
successes and X,=n—X,=number of failures. 


Example 4.10 When a certain method is used to collect a fixed volume of rock 
samples in a region, there are four resulting rock types. Let X,, X2, and X3 denote the 
proportion by volume of rock types 1, 2, and 3 in a randomly selected sample (the 
proportion of rock type 4 is 1 — X,; — Xz — X3, so a variable X4 would be redundant). 
If the joint pdf of X,, Xo, X3 is 


f (X1,X2,%3) 
— f kxyx2(1 — x3) O0<x <1, 0O<my <1, 0O<x3 <1, y+m4+%3<1 
7 0 otherwise 


then k is determined by 
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1=| | | f (1, %2,%3 )dx3dx2dx, 


—OO J—OO J—CO 


1 1 =i 1 KR 
= | {| l| kxyx2(1 — ss)d| ax, ban 
0 Jo 0 


This iterated integral has value k/144, so k= 144. The probability that rocks of 
types 1 and 2 together account for at most 50% of the sample is 


P(X, +X2 < .5) = \\] f (1, %2,%3)dx3dx2dx1 


a. for i= 1,2,3 
+4351, +o <5 


I 


The notion of independence of more than two random variables is similar to the 
notion of independence of more than two events. Random variables X,, X3,..., Xn 
are said to be independent if for every subset X;,,X;i,, ...,Xi, of the variables (each 
pair, each triple, and so on), the joint pmf or pdf of the subset is equal to the product 
of the marginal pmfs or pdfs. Thus if the variables are independent with n = 4, then 
the joint pmf or pdf of any two variables is the product of the two marginals, and 
similarly for any three variables and all four variables together. Most important, 
once we are told that ” variables are independent, then the joint pmf or pdf is the 
product of the n marginals. 


=) —Xi 1 —X1 —X2 
| | 144x)x9(1 — ss)d| a ban = .6066 
0 0 


Example 4.11 If X,, ..., X, represent the lifetimes of m components, the 
components operate independently of each other, and each lifetime is exponentially 
distributed with parameter /, then 


Pts she) OS™) eee) 
Oe Xx; > 0.2%) >0,...,%, > 0 
— 0 otherwise 


If these n components are connected in series, so that the system will fail as soon 
as a single component fails, then the probability that the system lasts past time f¢ is 


P(X, >t,...,Xn >?) -| =| f(%1,---,Xn)dx1...dXy 


t t 


= (| ie*ani)--(| ie *dsy) len) se 
t t 


Therefore, 


300 4 Joint Probability Distributions and Their Applications 


P(system lifetime < t) = 1— e-™ fort > 0 


which shows that system lifetime has an exponential distribution with parameter nA; 
the expected value of system lifetime is 1/(7A). 

A variation on the foregoing scenario appeared in the article “A Method for 
Correlating Field Life Degradation with Reliability Prediction for Electronic 
Modules” (Quality and Reliability Engr. Intl., 2005: 715-726). The investigators 
considered a circuit card with n soldered chip resistors. The failure time of a card is 
the minimum of the individual solder connection failure times (mileages here). It 
was assumed that the solder connection failure mileages were independent, that 
failure mileage would exceed ¢ if and only if the shear strength of a connection 
exceeded a threshold d, and that each shear strength was normally distributed with a 
mean value and standard deviation that depended on the value of mileage ft: y(t) = 
a,— at and o(t)=a3+a,4t (a weld’s shear strength typically deteriorates and 
becomes more variable as mileage increases). Then the probability that the failure 
mileage of a card exceeds ¢ is 


—— ( of! =) y 


The cited article suggested values for d and the a;s based on data. In contrast to 
the exponential scenario, normality of individual lifetimes does not imply normality 
of system lifetime. a 


Example 4.11 gives you a taste of the sub-field of probability called reliability, 
the study of how long devices and/or systems operate; see Exercises 16 and 17 as 
well. We will explore reliability in great depth in Sect. 4.8. 


4.1.5 Exercises: Section 4.1 (1-22) 


1. A service station has both self-service and full-service islands. On each island, 
there is a single regular unleaded pump with two hoses. Let X denote the 
number of hoses being used on the self-service island at a particular time, 
and let Y denote the number of hoses on the full-service island in use at that 
time. The joint pmf of X and Y appears in the accompanying table. 


p(x, y) |_0 I 2 
0 10 04 .02 
x 1 08 20 .06 
2 .06 14 30 


(a) What is P(X = 1 and Y= 1)? 
(b) Compute P(X < 1 and Y< 1). 


4.1 Jointly Distributed Random Variables 301 


(c) Give a word description of the event {xX 40 and Y#0}, and compute the 
probability of this event. 

(d) Compute the marginal pmf of X and of Y. Using px(x), what is P(X < 1)? 

(e) Are X and Y independent rvs? Explain. 

2. A large but sparsely populated county has two small hospitals, one at the south 
end of the county and the other at the north end. The south hospital’s emer- 
gency room has 4 beds, whereas the north hospital’s emergency room has only 
3 beds. Let X denote the number of south beds occupied at a particular time on a 
given day, and let Y denote the number of north beds occupied at the same time 
on the same day. Suppose that these two rvs are independent, that the pmf of 
X puts probability masses .1, .2, .3, .3, and .2 on the x values 0, 1, 2, 3, and 
4, respectively, and that the pmf of Y distributes probabilities .1, .3, .4, and .2 on 
the y values 0, 1, 2, and 3, respectively. 

(a) Display the joint pmf of X and Y in a joint probability table. 

(b) Compute P(X < 1 and Y< 1) by adding probabilities from the joint pmf, 
and verify that this equals the product of P(X < 1) and P(Y < 1). 

(c) Express the event that the total number of beds occupied at the two 
hospitals combined is at most 1 in terms of X and Y, and then calculate 
this probability. 

(d) What is the probability that at least one of the two hospitals has no beds 
occupied? 

3. A market has both an express checkout line and a superexpress checkout line. 
Let X, denote the number of customers in line at the express checkout at a 
particular time of day, and let X> denote the number of customers in line at the 
superexpress checkout at the same time. Suppose the joint pmf of X, and X> is 
as given in the accompanying table. 


aa 
WN rF OO 
is, i te 

n 

f=) 

a 

— 

(=) 

fo) 

an 


(a) What is P(X, = 1, X.= 1), that is, the probability that there is exactly one 
customer in each line? 

(b) What is P(X, = X3), that is, the probability that the numbers of customers in 
the two lines are identical? 

(c) Let A denote the event that there are at least two more customers in one line 
than in the other line. Express A in terms of X; and X2, and calculate the 
probability of this event. 
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(d) What is the probability that the total number of customers in the two lines is 
exactly four? At least four? 

(e) Determine the marginal pmf of X,, and then calculate the expected number 
of customers in line at the express checkout. 

(f) Determine the marginal pmf of X>. 

(g) By inspection of the probabilities P(X,;=4), P(X,=0), and 
P(X, =4, X2=0), are X, and Xz independent random variables? Explain. 


. Suppose 51% of the individuals in a certain population have brown eyes, 32% 


have blue eyes, and the remainder have green eyes. Consider a random sample 

of 10 people from this population. 

(a) What is the probability that 5 of the 10 people have brown eyes, 3 of 
10 have blue eyes, and the other 2 have green eyes? 

(b) What is the probability that exactly one person in the sample has blue eyes 
and exactly one has green eyes? 

(c) What is the probability that at least 7 of the 10 people have brown eyes? 
[Hint: Think of brown as a success and all other eye colors as failures. ] 


. At acertain university, 20% of all students are freshmen, 18% are sophomores, 


21% are juniors, and 41% are seniors. As part of a promotion, the university 

bookstore is running a raffle for which all students are eligible. Ten students 

will be randomly selected to receive prizes (in the form of textbooks for the 

term). 

(a) What is the probability the winners consist of two freshmen, two 
sophomores, two juniors, and four seniors? 

(b) What is the probability the winners are split equally among underclassmen 
(freshmen and sophomores) and upperclassmen (juniors and seniors)? 

(c) The raffle resulted in no freshmen being selected. The freshman class 
president complained that something must be amiss for this to occur. Do 
you agree? Explain. 


. According to the Mars Candy Company, the long-run percentages of various 


colors of M&M’s milk chocolate candies are as follows: 

Blue: 24% Orange: 20% Green: 16% Yellow: 14% Red: 13% Brown: 13% 

(a) In a random sample of 12 candies, what is the probability that there are 
exactly two of each color? 

(b) In a random sample of 6 candies, what is the probability that at least one 
color is not included? 

(c) In a random sample of 10 candies, what is the probability that there are 
exactly 3 blue candies and exactly 2 orange candies? 

(d) In a random sample of 10 candies, what is the probability that there are at 
most 3 orange candies? [Hint: Think of an orange candy as a success and 
any other color as a failure.] 

(e) Inarandom sample of 10 candies, what is the probability that at least 7 are 
either blue, orange, or green? 


. The number of customers waiting for gift-wrap service at a department store is 


an rv X with possible values 0, 1, 2, 3, 4 and corresponding probabilities .1, .2, 
3, .25, .15. A randomly selected customer will have 1, 2, or 3 packages for 


4 
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wrapping with probabilities .6, .3, and .1, respectively. Let Y= the total number 
of packages to be wrapped for the customers waiting in line (assume that the 
number of packages submitted by one customer is independent of the number 
submitted by any other customer). 

(a) Determine P(X =3, Y=3), that is, p(3, 3). 

(b) Determine p(4, 11). 


. Let X denote the number of Canon digital cameras sold during a particular 


week by a certain store. The pmf of X is 


x {0 I 2 3 4 
Bolt 2 2 “25 2s 


Sixty percent of all customers who purchase these cameras also buy an 
extended warranty. Let Y denote the number of purchasers during this week 
who buy an extended warranty. 

(a) What is P(X =4, Y=2)? [Hint: This probability equals P(Y = 2|X =4)- 
P(X =4); now think of the four purchases as four trials of a binomial 
experiment, with success on a trial corresponding to buying an extended 
warranty. ] 

(b) Calculate P(X =Y). 

(c) Determine the joint pmf of X and Y and then the marginal pmf of Y. 


. The joint probability distribution of the number X of cars and the number Y of 


buses per signal cycle at a proposed left-turn lane is displayed in the 
accompanying joint probability table. 


y 
D(x, y) 0 1 2: 
0 025 015 .010 
1 050 .030 020 
2 125 .075 050 
x 3 150 .090 060 
4 100 .060 040 
5 050 .030 020 


(a) What is the probability that there is exactly one car and exactly one bus 
during a cycle? 

(b) What is the probability that there is at most one car and at most one bus 
during a cycle? 

(c) What is the probability that there is exactly one car during a cycle? Exactly 
one bus? 

(d) Suppose the left-turn lane is to have a capacity of five cars, and one bus is 
equivalent to three cars. What is the probability of an overflow during a 
cycle? 

(e) Are X and Y independent rvs? Explain. 
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A stockroom currently has 30 components of a certain type, of which 8 were 

provided by supplier 1, 10 by supplier 2, and 12 by supplier 3. Six of these are 

to be randomly selected for a particular assembly. Let X =the number of 
supplier 1’s components selected, Y= the number of supplier 2’s components 

selected, and p(x, y) denote the joint pmf of X and Y. 

(a) What is p(3, 2)? [Hint: Each sample of size 6 is equally likely to be 
selected. Therefore, p(3, 2)=(number of outcomes with X=3 and 
Y = 2)/(total number of outcomes). Now use the product rule for counting 
to obtain the numerator and denominator. ] 

(b) Using the logic of part (a), obtain p(@, y). (This can be thought of as a 
multivariate hypergeometric distribution—sampling without replacement 
from a finite population consisting of more than two categories.) 

Each front tire of a vehicle is supposed to be filled to a pressure of 26 psi. 

Suppose the actual air pressure in each tire is a random variable—X for the right 

tire and Y for the left tire, with joint pdf 


yf k02ty?) 20<x<30, 20<y<30 
f(y) = { 0 otherwise 


(a) What is the value of k? 

(b) What is the probability that both tires are underfilled? 

(c) What is the probability that the difference in air pressure between the two 
tires is at most 2 psi? 

(d) Determine the (marginal) distribution of air pressure in the right tire alone. 

(e) Are X and Y independent rvs? 

Annie and Alvie have agreed to meet between 5:00 and 6:00 p.m. for dinner at a 

local health-food restaurant. Let X = Annie’s arrival time and Y= Alvie’s 

arrival time. Suppose X and Y are independent with each uniformly distributed 

on the interval [5, 6]. 

(a) What is the joint pdf of X and Y? 

(b) What is the probability that they both arrive between 5:15 and 5:45? 

(c) If the first one to arrive will wait only 10 min before leaving to eat 
elsewhere, what is the probability that they have dinner at the health-food 
restaurant? [Hint: The event of interest is A = {(x,y) :|x—y |< 4}.] 

Two different professors have just submitted final exams for duplication. Let 

X denote the number of typographical errors on the first professor’s exam and 

Y denote the number of such errors on the second exam. Suppose X has a 

Poisson distribution with parameter 4,, Y has a Poisson distribution with 

parameter plz, and X and Y are independent. 

(a) What is the joint pmf of X and Y? 

(b) What is the probability that at most one error is made on both exams 
combined? 
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(c) Obtain a general expression for the probability that the total number of 
errors in the two exams is m (where m is a nonnegative integer). [Hint: A = 
{(x, y): x+y=m}={(m, 0), (m—1, 1D, ..., , m— 1), (0, m)}. Now sum 
the joint pmf over (x, y) © A and use the binomial theorem, which says that 


‘ (7 ator =(a+b)" 


for any a, b.] 
Two components of a computer have the following joint pdf for their useful 
lifetimes X and Y: 


—x(1+y) >0 d >0 
x | te x>0 and y> 
fy) { 0 otherwise 


(a) What is the probability that the lifetime X of the first component exceeds 3? 

(b) What are the marginal pdfs of X and Y? Are the two lifetimes independent? 
Explain. 

(c) What is the probability that the lifetime of at least one component 
exceeds 3? 

You have two lightbulbs for a particular lamp. Let X = the lifetime of the first 

bulb and Y=the lifetime of the second bulb (both in thousands of hours). 

Suppose that X and Y are independent and that each has an exponential 

distribution with parameter 4 = 1. 

(a) What is the joint pdf of X and Y? 

(b) What is the probability that each bulb lasts at most 1000 h (i.e., X < 1 and 
Y<1)? 

(c) What is the probability that the total lifetime of the two bulbs is at most 2? 
[Hint: Draw a picture of the region A= {(x, y): x>0, y>0, x+y <2} 
before integrating. ] 

(d) What is the probability that the total lifetime is between | and 2? 

Suppose that you have ten lightbulbs, that the lifetime of each is independent of 

all the other lifetimes, and that each lifetime has an exponential distribution 

with parameter 4. 

(a) What is the probability that all ten bulbs fail before time fr? 

(b) What is the probability that exactly k of the ten bulbs fail before time f? 

(c) Suppose that nine of the bulbs have lifetimes that are exponentially 
distributed with parameter 4 and that the remaining bulb has a lifetime 
that is exponentially distributed with parameter @ (it is made by another 

manufacturer). What is the probability that exactly five of the ten bulbs fail 

before time f? 

Consider a system consisting of three components as pictured. The system will 

continue to function as long as the first component functions and either 

component 2 or component 3 functions. Let X;, X2, and X3 denote the lifetimes 
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of components 1, 2, and 3, respectively. Suppose the X;s are independent of 
each other and each X; has an exponential distribution with parameter J. 


D 


3 


(a) Let Y denote the system lifetime. Obtain the cumulative distribution func- 
tion of Y and differentiate to obtain the pdf. [Hint: F(y) = P(Y < y); express 
the event {Y < y} in terms of unions and/or intersections of the three events 
{X; Sy}, {Xo <y}, and {X3<y}.] 

(b) Compute the expected system lifetime. 

(a) For f(x, x2, x3) as given in Example 4.10, compute the joint marginal 
density function of X, and X3 alone (by integrating over x2). 

(b) What is the probability that rocks of types 1 and 3 together make up at most 
50% of the sample? [Hint: Use the result of part (a).] 

(c) Compute the marginal pdf of X, alone. [Hint: Use the result of part (a).] 

An ecologist selects a point inside a circular sampling region according to a 

uniform distribution. Let X = the x coordinate of the point selected and Y = the 

y coordinate of the point selected. If the circle is centered at (0, 0) and has 

radius 7, then the joint pdf of X and Y is 


ee x + y <r 
f(x,y) = 4 7 
0 otherwise 


(a) What is the probability that the selected point is within 1/2 of the center of 
the circular region? [Hint: Draw a picture of the region of positive density 
D. Because f(x, y) is constant on D, computing a probability reduces to 
computing an area. | 

(b) What is the probability that both X and Y differ from 0 by at most 7/2? 

(c) Answer part (b) for r/ af 2 replacing r/2. 

(d) What is the marginal pdf of X? Of Y? Are X and Y independent? 

Each customer making a particular Internet purchase must pay with one of 

three types of credit cards (think Visa, MasterCard, AmEx). Let A; (i= 1, 2, 3) 

be the event that a type 7 credit card is used, with P(A,)=.5, P(A2)=.3, 

P(A3) =.2. Suppose that the number of customers who make a purchase on a 

given day, N, is a Poisson rv with parameter y. Define rvs X,, X2, X3 by X; = the 

number among the N customers who use a type i card (i= 1, 2, 3). Show that 

these three rvs are independent with Poisson distributions having parameters 
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Se, 3M, and .2u, respectively. [Hint: For non-negative integers x), x2, x3, let 
N=X,+X.+x3. Then P(X, = x1, Xo =X, X3 = 3) = P(X, =X, Xo =X, X3 = 3, 
N=n). Now condition on N =n, in which case the three X;s have a trinomial 
distribution (multinomial with 3 categories) with category probabilities .5, .3, 
and .2.] 

21. Consider randomly selecting two points A and B on the circumference of a 
circle by selecting their angles of rotation, in degrees, independently from a 
uniform distribution on the interval [0, 360]. Connect points A and B with a 
straight line segment. What is the probability that this random chord is longer 
than the side of an equilateral triangle inscribed inside the circle? 

(This is called Bertrand’s Chord Problem in the probability literature. There 
are other ways of randomly selecting a chord that give different answers from 
the one appropriate here.) [Hint: Place one of the vertices of the inscribed 
triangle at A. You should then be able to intuit the answer visually without 
having to do any integration. ] 

22. Section 3.8 introduced the accept—reject method for simulating continuous rvs. 
Refer back to that algorithm in order to answer the questions below. 

(a) Show that the probability a candidate value is “accepted” equals 1/c. [Hint: 
According to the algorithm, this occurs iff U<f(Y)/cg(Y), where 
U ~ Uniform[0, 1) and Y~ g. Compute the relevant double integral.] 

(b) Argue that the average number of candidates required to generate a single 
accepted value is c. 

(c) Show that the accept—reject method does result in an observation from 
the pdf f by showing that P(accepted value <x) = F(x), where F is the 
cdf corresponding to f. [Hint: Let X denote the accepted value. Then 
P(X <x) =P(Y <xlY accepted) = P(Y <x Y accepted)/P(Y accepted).] 


4.2. _ Expected Values, Covariance, and Correlation 


We previously saw that any function h(X) of a single rv X is itself a random 
variable. However, to compute E[h(X)], it was not necessary to obtain the proba- 
bility distribution of h(X); instead, E[(X)] was computed as a weighted average of 
h(X) values, where the weight function was the pmf p(x) or pdf f(x) of X. A similar 
result holds for a function h(X, Y) of two jointly distributed random variables. 
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PROPOSITION 

Let X and Y be jointly distributed rvs with pmf p(x, y) or pdf f(x, y) according 
to whether the variables are discrete or continuous. Then the expected value 
of a function h(X, Y), denoted by E[h(X, Y)] or jcx,y), iS given by 


ye > h(x, y) « p(x, y) if X and Y are discrete 
DIRS oie. es 
| | h(x,y) -f (x,y) dxdy if X and Y are continuous 
(4.2) 
The method of computing the expected value of a function h(X,, ..., X,) of 


nrandom variables is similar to Eq. (4.2). If the X;s are discrete, E[h(X,, ..., X,,)] is 
an n-dimensional sum; if the X;s are continuous, it is an n-dimensional integral. 


Example 4.12 Five friends have purchased tickets to a concert. If the tickets are for 
seats 1-5 in a particular row and the tickets are randomly distributed among the 
five, what is the expected number of seats separating any particular two of the five 
friends? Let X and Y denote the seat numbers of the first and second individuals, 
respectively. Possible (X, Y) pairs are {(1, 2), (1, 3), ..., (5, 4)},and the joint pmf of 
(X, Y) is 


0 otherwise 


The number of seats separating the two individuals is h(X, Y) = |X — Yl— 1. The 
accompanying table gives h(x, y) for each possible (x, y) pair. 


xX 
A(x,y)| 1 : 3 4 5 
1 = 0 1 5 3 
2 0 - 0 1 2 
y 3 1 0 = 0 1 
4 2 1 0 = 0 
5 3 2) 1 0 2 
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Thus 


Example 4.13 In Example 4.5, the joint pdf of the amount X of almonds and 
amount Y of cashews in a 1-lb can of nuts was 


_ [day O<x<1, O<y<1, xt+y<1 
fy) -{ 0 otherwise 


If 1 lb of almonds costs the company $6.00, 1 Ib of cashews costs $10.00, and 
1 Ib of peanuts costs $3.50, then the total cost of the contents of a can is 


h(X,Y) = 6X + 10Y +3.5(1 -X —Y) =3.5+2.5X+6.5Y 


(since 1 — X — Y of the weight consists of peanuts). The expected total cost is 


E[h(X,¥)] = [- [- (aes) fey ededly 


—0o J—0O 


1 pl—-x 
- | | (3.5 + 2.5x + 6.5y) - 24xy dydx = $7.10 | 
0 Jo 


4.2.1 Properties of Expected Value 


In Chaps. 2 and 3, we saw that expected values can be distributed across addition, 
subtraction, and multiplication by constants. In the language of mathematics, 
expected value is a /inear operator. This was a simple consequence of expectation 
being a sum or an integral, both of which are linear. This obvious but important 
property, linearity of expectation, extends to more than one variable. 


LINEARITY OF EXPECTATION 
Let X and Y be random variables. Then, for any functions h;, hz and any 
constants aj, do, b, 


Elayh (X,Y) + ayho(X, Y) + b] = ayE [hy (X, Y)] + @E[ho(X,Y)] +b 
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In the previous example, E(3.5 +2.5X + 6.5Y) can be rewritten as 3.5 +2.5E(X) + 
6.5E(Y); the means of X and Y can be computed either by using Eq. (4.2) or by first 
finding the marginal pdfs of X and Y and then computing the appropriate single 
integrals. 

As another illustration, linearity of expectation tells us that for any two rvs 
X and Y, 


E(SXY? — 4XY + e* + 12) = 5E(XY’) — 4E(XY) + E(e*) + 12 (4.3) 


In general, we cannot distribute the expected value operation any further. But when 
h(X, Y) is a product of a function of X and a function of Y, the expected value 
simplifies in the case of independence. 


THEOREM 
Let X and Y be independent random variables. If h(X, Y) = g;(X) - g2(Y), then 


E|h(X, Y)] = Elgi(X) - 82(¥)] = Elgi(X)] - Elg2(¥)] 


Proof We present the proof here for two continuous rvs; the discrete case is 
similar. Apply Eq. (4.2): 


EM] Ele (X)-82(Y)]=] | ailc-e260)-Fles)aeay by (42) 


= | £1 (x)-85(y¥)-fy(x) -fy(y)dxdy because X and Y are independent 


—oo 


=([7sito-rxear) ([" e20)-f+004y) =e[6,00]2 [620] 


—0o 


So, if X and Y are independent, Eq. (4.3) simplifies further, to 5E(X)E(Y’) - 
4E(X)E(Y) + E(e*) +12. Not surprisingly, both linearity of expectation and the 
foregoing corollary can be extended to more than two random variables. 


4.2.2. Covariance 


When two random variables X and Y are not independent, it is frequently of interest 
to assess how strongly they are related to each other. 
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DEFINITION 
The covariance between two rvs X and Y is 


Cov(X,Y) =E[(X—px) (Y¥—py)] 


So (e=Hx) (9 = Hy) PY) if X and Y are discrete 
as CanaPs 
| | (x—py)(y—py)f (x,y) dx dy ifX and Y are continuous 


The rationale for the definition is as follows. Suppose X and Y have a strong 
positive relationship to each other, by which we mean that large values of X tend to 
occur with large values of Y and small values of X with small values of Y (e.g., 
X=height and Y= weight). Then most of the probability mass or density will be 
associated with (x — py) and (y — py) either both positive (both X and Y above their 
respective means) or both negative, so the product (x — pty) (y — Hy) will tend to be 
positive. Thus for a strong positive relationship, Cov(X, Y) should be quite positive. 
For a strong negative relationship, the signs of (x — py) and (y — fy) will tend to be 
opposite, resulting in a negative product. Thus for a strong negative relationship, 
Cov(X, Y) should be quite negative. If X and Y are not strongly related, positive and 
negative products will tend to cancel each other, yielding a covariance near 
0. Figure 4.4 illustrates the different possibilities. The covariance depends on 
both the set of possible pairs and the probabilities. In Fig. 4.4, the probabilities 
could be changed without altering the set of possible pairs, and this could drastically 
change the value of Cov(X, Y). 


Example 4.14 The joint and marginal pmfs for X = automobile policy deductible 
amount and Y= homeowner policy deductible amount in Example 4.1 were 


y 
poy) | 0 100 200 
100 20 10 20 
* 250 05 15 30 
x | 100 250 y | 0 100 200 
pax) | 5 5 pry) | .25 25 50 


from which pty = x: px(x) = 175 and py = 125. Therefore, 
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a b c 
Va yh ya 
a i ae = Ce 
I Seca we. | i 
ae oe! | 
| : | | 
Hy oe Hy naa a Hy ee ee es 
aa | ee eel e 
eo 
ae gh cee 
> xX > xX t > xX 
My Hy uy 


Fig. 4.4 p(x, y)=.10 for each of ten pairs corresponding to indicated points; (a) positive 
covariance; (b) negative covariance; (c) covariance near zero 


Cov(X,¥) = S° So (x— 175)(y — 125)p(x,y) 
(x,y) 


= (100 — 175) (0 — 125) (.20) + --- 
+(250 — 175) (200 — 125) (.30) = 1875 a 


The following proposition summarizes some important properties of covariance. 


PROPOSITION 

For any two random variables X and Y, 

1. Cov(X, Y) = Cov(Y, X) 

2. Cov(X, X) = Var(X) 

3. (Covariance shortcut formula) Cov(X, Y) = E(XY) — py: by 

4. (Distributive property of covariance) For any rv Z and any constants, a, b, c, 


Cov(aX + bY + c,Z) = aCov(X, Z) + bCov(Y, Z) 


Proof Property 1 is obvious from the definition of covariance. To establish prop- 
erty 2, replace Y with X in the definition: 


Cov(X,X) = E[(X = wx)(X = wx)] = E[(X = wx)] = Var(x) 


To prove property 3, apply linearity of expectation: 
Cov(X,Y) = E|(x - bx) (Y — Hy) | 
= E(XY — px — pyX + uxby) 
= E(XY) — pyE(Y) — wyE(X) + wxmy 
= E(XY) — bxby — Uyby + Uyby = E(XY) — pymy 


Property 4 also follows from linearity of expectation (Exercise 39). a 
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According to property 3, the covariance shortcut, no intermediate subtractions 
are necessary to calculate covariance; only at the end of the computation is py - uy 
subtracted from E(XY). 


Example 4.15 (Example 4.5 continued) The joint and marginal pdfs of 
X = amount of almonds and Y= amount of cashews were 


fey = O<x<1, O<y<1, xt+y<1 
fy) = { 0 otherwise 


_ fi2xi-x? O0<x<1 
Fal) = { 0 otherwise 


with f(y) obtained by replacing x by y in fy(x). It is easily verified that wy = py 


1 


oe) oe) 1 pl—-x 
E(XY) = | | xyf (x, y)dxdy = i |, xy + 24xy dydx = 5. (1 —x) dx 


Thus Cov(X,Y) = < (2) (2) = = == = A negative covariance is 


reasonable here because more almonds in the can implies fewer cashews. (= 


4.2.3 Correlation 


It would appear that the relationship in the insurance example is quite strong since 
Cov(X, Y) = 1875, whereas in the nut example Cov(X, Y) = —2/75 would seem to 
imply quite a weak relationship. Unfortunately, the covariance has a serious defect 
that makes it impossible to interpret a computed value of the covariance. In the 
insurance example, suppose we had expressed the deductible amount in cents 
rather than in dollars. Then 100X would replace X, 1O0Y would replace Y, and the 
resulting covariance would be Cov(100X, 100Y) = (100)(100)Cov(X, Y) = 18,750,000. 
[To see why, apply properties | and 4 of the previous proposition.] If, on the other 
hand, the deductible amounts had been expressed in hundreds of dollars, the 
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computed covariance would have been (.01)(.01)(1875) =.1875. The defect of 
covariance is that its computed value depends critically on the units of measure- 
ment. Ideally, the choice of units should have no effect on a measure of strength of 
relationship. This is achieved by scaling the covariance. 


DEFINITION 
The correlation coefficient of X and Y, denoted by Corr(X, Y), or py.y, or just 
p, is defined by 


Cov(X, Y) 


Pow = 
Ox : Oy 


Example 4.16 It is easily verified that in the insurance scenario of Example 4.14, 
E(X’) = 36,250, 0% = 36,250 — (175) = 5625, 6x =75, E(Y) = 22,500, oy = 6875, 
and oy = 82.92. This gives 

1875 


P= 95)(82.92) 70! | 


The following proposition shows that p remedies the defect of Cov(X, Y) and 
also suggests how to recognize the existence of a strong (linear) relationship. 


PROPOSITION 

For any two rvs X and Y, 

1. Corr(X, Y) = Corr(Y, X) 

2. Corr(X, X)= 1 

3. (Scale invariance property) If a, b, c, d are constants and ac > 0, 


Corr(aX + b,cY + d) = Corr(X, Y) 


4. -1 <Cort(X, Y)<1 


Proof Property 1 is clear from the definition of correlation and the corresponding 
property of covariance. For Property 2, write Corr(X, X) = Cov(X, X)/[ox - ox] = 
Var(X)/ox = 1. The second-to-last step uses Property 2 of covariance. The proofs of 
Properties 3 and 4 appear as exercises. = 
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Property 3 (scale invariance) says precisely that the correlation coefficient is not 
affected by a linear change in the units of measurement. If, say, Y= completion 
time for a chemical reaction in seconds and X = temperature in °C, then Y/60 = time 
in minutes and 1.8X+32= temperature in °F, but Corr(X, Y) will be exactly the 
same as Corr(1.8X + 32, Y/60). 

According to Properties 2 and 4, the strongest possible positive relationship is 
evidenced by p=+l, whereas the strongest possible negative relationship 
corresponds to p =-—1. Therefore, the correlation coefficient provides information 
about both the nature and strength of the relationship between X and Y: the sign of p 
indicates whether X and Y are positively or negatively related, and the magnitude of 
p describes the strength of that relationship on an absolute 0-1 scale. 

While superior to covariance, the correlation coefficient p is actually not a 
completely general measure of the strength of a relationship. 


PROPOSITION 

1. If X and Y are independent, then p=0, but p=0 does not imply 
independence. 

2. p=1 or-1 iff Y=aX+b for some numbers a and b with a £0. 


Exercise 38 and Example 4.17 relate to Statement 1, and Statement 2 is 
investigated in Exercises 41 and 42(d). 

This proposition says that p is a measure of the degree of /inear relationship 
between X and Y, and only when the two variables are perfectly related in a linear 
manner will p be as positive or negative as it can be. A p less than | in absolute 
value indicates only that the relationship is not completely linear, but there may still 
be a very strong nonlinear relation. Also, »=0 does not imply that X and Y are 
independent, but only that there is complete absence of a linear relationship. When 
p =0, X and Y are said to be uncorrelated. Two variables could be uncorrelated yet 
highly dependent because of a strong nonlinear relationship, so be careful not to 
conclude too much from knowing that p= 0. 


Example 4.17 Let X and Y be discrete rvs with joint pmf 


neo ={ 25 = C41) (6 1) (22).2, 2 


0 otherwise 


The points that receive positive probability mass are identified on the (x, y) 
coordinate system in Fig. 4.5. It is evident from the figure that the value of X is 
completely determined by the value of Y and vice versa, so the two variables are 
completely dependent. However, by symmetry py = vy = 0 and E(XY) = (—4)(.25) 
+ (—4)(.25) + (4)(.25) + (4)..25) =0, so Cov(X, Y) = E(XY) — py: fy =0 and thus 
px,y =0. Although there is perfect dependence, there is also complete absence of 
any linear relationship! 
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Fig. 4.5 The population of 2> e 
pairs for Example 4.17 
® I+ 
4 3 2 1 1 2 3 4 
-l+ e 
e om 


The next result provides an alternative view of zero correlation. 


PROPOSITION 
Two rvs X and Y are uncorrelated if, and only if, E[XY] = py: py. 


Proof By its definition, Corr(X, Y) =0 iff Cov(X, Y)=0. Apply the covariance 
shortcut formula: 


p=08 Cov(X,Y) =0 8 E[XY] — wy - wy = 0 S E[XY] = py - py = 


Contrast this with an earlier proposition from this section: if X and Y are 
independent, then E[g,(X)g2(Y)] = E[gi(X)]-Elg2(Y)] for all functions g,; and go. 
Thus, independence is stronger than zero correlation, the latter just being the special 
case corresponding to g;(X)=X and g(Y)=Y. 


4.2.4 Correlation Versus Causation 


A value of p near | does not necessarily imply that increasing the value of X causes 
Y to increase. It implies only that large X values are associated with large Y values. 
For example, in the population of children, vocabulary size and number of cavities 
are quite positively correlated, but it is certainly not true that cavities cause 
vocabulary to grow. Instead, the values of both these variables tend to increase as 
the value of age, a third variable, increases. For children of a fixed age, there is 
probably a very low correlation between number of cavities and vocabulary size. In 
summary, association (a high correlation) is not the same as causation. 
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4.2.5 Exercises: Section 4.2 (23-42) 


23. 


24. 


25. 


The two most common types of errors made by programmers are syntax errors 
and logic errors. Let X denote the number of syntax errors and Y the number of 
logic errors on the first run of a program. Suppose X and Y have the following 
joint pmf for a particular programming assignment: 


x 
pay) | 0 1 p) 3 
0 71 03 02 Ol 
yoy 04 06 03 01 
2 03 03 02 Ol 


(a) What is the probability a program has more syntax errors than logic errors 
on the first run? 

(b) Find the marginal pmfs of X and Y. 

(c) Are X and Y independent? How can you tell? 

(d) What is the average number of syntax errors in the first run of a program? 
What is the average number of logic errors? 

(e) Suppose an evaluator assigns points to each program with the formula 
100 — 4X — 9Y. What is the expected point score for a randomly selected 
program? 

An instructor has given a short quiz consisting of two parts. For a randomly 

selected student, let X =the number of points earned on the first part and 

Y = the number of points earned on the second part. Suppose that the joint pmf 

of X and Y is given in the accompanying table. 


y 
pix,y) | 0 5 10 15 

0 02 06 02 10 

x 5 04 15 20 10 
10 01 15 14 01 


(a) If the score recorded in the grade book is the total number of points earned 
on the two parts, what is the expected recorded score E(X + Y)? 

(b) If the maximum of the two scores is recorded, what is the expected 
recorded score? 

The difference between the number of customers in line at the express 

checkout and the number in line at the superexpress checkout in Exercise 

3 is X; — X>. Calculate the expected difference. 
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26. 


27. 


28. 


29. 


30. 


31. 


32. 


33. 
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Six individuals, including A and B, take seats around a circular table 
in a completely random fashion. Suppose the seats are numbered 1, ..., 
6. Let X= A’s seat number and Y=B’s seat number. If A sends a written 
message around the table to B in the direction in which they are closest, 
how many individuals (including A and B) would you expect to handle 
the message? 

A surveyor wishes to lay out a square region with each side having length L. 
However, because of measurement error, he instead lays out a rectangle in 
which the north-south sides both have length X and the east—west sides both 
have length Y. Suppose that X and Y are independent and that each is 
uniformly distributed on the interval [L — A, L+ A] (where 0<A<L). What 
is the expected area of the resulting rectangle? 

Consider a small ferry that can accommodate cars and buses. The toll for cars 
is $3, and the toll for buses is $10. Let X and Y denote the number of cars and 
buses, respectively, carried on a single trip. Suppose the joint distribution of 
X and Y is as given in the table of Exercise 9. Compute the expected revenue 
from a single trip. 

Annie and Alvie have agreed to meet for lunch between noon (0:00 p.m.) and 
1:00 p.m. Denote Annie’s arrival time by X, Alvie’s by Y, and suppose X and 
Y are independent with pdfs 


ers ae O<*<1 
JX 10 otherwise 

_ jy O<y<l 
fr) = { 0 otherwise 


What is the expected amount of time that the one who arrives first must wait 
for the other person? [Hint: h(X, Y) =|X — Y1.] 

Suppose that X and Y are independent rvs with moment generating functions 
My(t) and My(t), respectively. If Z=X+Y, show that Mz7(t) = My(t)- My(t). 
[Hint: Use the proposition on the expected value of a product.] 

Compute the correlation coefficient p for X and Y of Example 4.15 (the 
covariance has already been computed). 

(a) Compute the covariance for X and Y in Exercise 24. 

(b) Compute p for X and Y in the same exercise. 

(a) Compute the covariance between X and Y in Exercise 11. 

(b) Compute the correlation coefficient p for this X and Y. 


4.2 


34. 


35. 


36. 


37. 


38. 
. Use linearity of expectation to establish the covariance property 


40. 
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Reconsider the computer component lifetimes X and Y as described in Exercise 
14. Determine E(XY). What can be said about Cov(X, Y) and p? 
Refer back to Exercise 23. 
(a) Calculate the covariance of X and Y. 
(b) Calculate the correlation coefficient of X and Y. Interpret this value. 
In practice, it is often desired to predict the value of a variable Y from the 
known value of some other variable, X. For example, a doctor might wish to 
predict the lifespan Y of someone who smokes X cigarettes a day, or an 
engineer may require predictions of the tensile strength Y of steel made with 
concentration X of a certain additive. A linear predictor of Y is anything of the 
form Y = a+ bX; the “hat” “ on Y indicates prediction. 

A common measure of the quality of a predictor is given by the mean square 
prediction error: 


(a) Show that the choices of a and b that minimize mean square prediction 
error are 


Oy 


b=p a= py — b- py 


ox 
where p = Corr(X, Y). The resulting expression for Y is often called the best 
linear predictor of Y, given X. [Hint: Expand the expression for mean 
square prediction error, apply linearity of expectation, and then use 
calculus. ] 

(b) Determine the mean square prediction error for the best linear predictor. 
How does the value of p affect this quantity? 

(a) Recalling the definition of o* for a single rv X, write a formula that would 
be appropriate for computing the variance of a function h(X, Y) of two 
random variables. [Hint: Remember that variance is just a special expected 
value. ] 

(b) Use this formula to compute the variance of the recorded score A(X, Y) 
[=max(X, Y)] in part (b) of Exercise 24. 

Show that when X and Y are independent, Cov(X, Y) = Corr(X, Y) =0. 


Cov(aX + bY + c,Z) = aCov(X,Z) + bCov(Y, Z) 


(a) Use the properties of covariance to show that Cov(aX+b, cY+d)= 
acCov(X, Y). 
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(b) Use part (a) along with the rescaling property of standard deviation to show 
that Corr(aX + b, cY + d) = Corr(X, Y) when ac > 0 (this is the scale invari- 
ance property of correlation). 

(c) What happens if a and c have opposite signs, so ac < 0? 

41. Show that if Y=aX+b (a40), then Corr(X, Y)=+1 or —-1. Under what 
conditions will p=+1? 
42. Let Zy be the standardized X, Zy = (X — py)/ox, let Zy be the standardized Y, 

Zy = (Y — py)/oy, and let p = Corr(X, Y). 

(a) Use properties of covariance and correlation to show that Corr(X, Y) = 

Cov(Zy, Zy) = E(ZyZy). 

(b) Use the linearity of expectation along with part (a), to show that 

E{(Zy — pZx)"| =1- p. {Hint: If Z is a standardized rv, what are its 

mean and variance, and how can you use those to determine E(Z’)?| 

(c) Use part (b) to show that -I1<p< 1. 

(d) Use part (b) to show that p= 1 implies that Y=aX+b where a>0, and 
p=-| implies that Y= aX +b where a <0. 


4.3 Properties of Linear Combinations 


A linear combination of random variables refers to anything of the form a,X,+--- 
+ad,X,+b, where the X;s are random variables and the a,;s and b are numerical 
constants. (Some sources do not include the constant b in the definition.) For 
example, suppose your investment portfolio with a particular financial institution 
includes 100 shares of stock #1, 200 shares of stock #2, and 500 shares of stock #3. 
Let X,, X2, and X3 denote the share prices of these three stocks at the end of the 
current fiscal year. Suppose also that the financial institution will levy a manage- 
ment fee of $150. Then the value of your investments with this institution at the end 
of the year is 100X, + 200X> + 500X3 — 150, which is a particular linear combina- 
tion. Important special cases include the total X;+---+X,, (take qj =---=a,=1, 
b=O0), the difference of two rvs X; — X2 (n=2, a, = 1, ao 1), and anything of the 
form aX + b (take n= 1 or, equivalently, set aj =... =a, =0). Another very impor- 
tant linear combination is the sample mean (X,+---+X,,)/n, conventionally 
denoted _X; just take aj =--- =a, =1/n and b=0. 

Notice that we are not requiring the X;s to be independent or to have the same 
probability distribution. All the X;s could have different distributions and therefore 
different mean values and standard deviations. In this section, we investigate the 
general properties of linear combinations. Section 4.5 will explore some special 
properties of the total and the sample mean under additional assumptions. 

We first consider the expected value and variance of a linear combination. 
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THEOREM 

Let the rvs X;, X2,..., X,, have mean values jy, .. ., “4, and standard deviations 
0, .--, On, respectively. 

1. Whether or not the X;s are independent, 


E(a)X1 + +++ +.anXn +b) = ay E(X1) +++» + anE(Xn) + (4.4) 
= ayy +++ + An, +b 


and 


Var(aiX1 +--+ +a,X, +b) = S~S “aiajCov (Xi, Xj) 
a (4.5) 
= Daa. + 22 S- a;ajCov (X;,X;) 
i=1 i<j 
2. If X;, ..., X, are independent, 


Var(a,X; +--+: +a,X, +b) = atVar(X1) fee a’Var (X,,) 


Sy) Ph aoe 
= ayo; ap 998 ap (or 


(4.6) 


and 


SD(a1X +++ +a)Xy + b) = \/a303 + ++ ao? 


A paraphrase of Eq. (4.4) is that the expected value of a linear combination is the 
same linear combination of the expected values—for example, E(2X,+5X2)= 
2, +52. Equation (4.6) in Statement 2 is a special case of Eq. (4.5) in Statement 
1: when the Xj;s are independent, Cov(X;, X;)=0 for i#j (this simplification 
actually occurs when the X;s are uncorrelated, a weaker condition than 
independence). 


Proofs for the Case n=2 To establish Eq. (4.4), we could invoke linearity of 
expectation from Sect. 4.2, but we present an independent proof here. Suppose that 
X, and X> are continuous with joint pdf f(x, x2). Then 
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CoO CO 
E(a,X, +a3X2+b)= | | (a,x, + agX2 + b)f (x1,x2) dx, dxy 


—oo 


=a | suf .t2)dtadni + aa | Xof (x1,%2)dx1dx2 


—OO J—O0O 


+6] [- f (1, x2) dx, dx. 
=a] xX] ii Flca2)an| dx, 
+a] X2 i Flsiaa)ai te +000) 


= a| xf, (1)dx1 + ao| Xof'y,(X2)dx2 + b 


—0oo 


= a,E(X1) + aE(X2) +b 


Summation replaces integration in the discrete case. The argument for Eq. (4.5) 
does not require specifying whether either variable is discrete or continuous. 
Recalling that Var(Y) =E[(Y—py)’], 

Var(a,X) + a2X2 +b) = E[( 
= E]( 
2 
= Elai(X ad 8 i a3(X> — fy)? 
+2ayar (Xi — M1) (X2 ~ Ha) | 
= aE[(X1 — M1)’ | + a3E| (X2 — #y)”] 
+2a)a2E[(X1 — 4) (X2 — M2) | 


aX, + aoX2 + b= (arpy + aopy + b))"] 
aX) — apy + 2X2 — apt)’ | 


where the last equality comes from linearity of expectation. We recognize the terms in 
this last expression as variances and covariance, all together arVar(X pt asVar(X>) + 
2a\d,Cov(X |, X>), as required. | 


Example 4.18 A gas station sells three grades of gasoline: regular, plus, and 
premium. These are priced at $3.50, $3.65, and $3.80 per gallon, respectively. Let 
X1, X2, and X3 denote the amounts of these grades purchased (gallons) on a particular 
day. Suppose the X;s are independent with yw, = 1000, “2 = 500, v3 = 300, o, = 100, 
02 = 80, and 03 = 50. The revenue from sales is Y= 3.5X, +3.65X2+3.8X3, and 


E(Y) = 3.5p + 3.65p + 3.8, = $6465 
Var(Y) = 3.5°0; + 3.65°05 + 3.8°03 = 243, 864 


SD(Y) = V/243, 864 = $493.83 MI 
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Example 4.19 Recall that a hypergeometric rv X is the number of successes in a 
random sample of size n selected without replacement from a population of size 
N consisting of M successes and N — M failures. It is tricky to obtain the mean value 
and variance of X directly from the pmf, and the hypergeometric moment 
generating function is very complicated. We now show how the foregoing proposi- 
tion on linear combinations can be used to accomplish this task. 

To this end, let X, = 1 if the first individual or object selected is a success and 
X, =O if it is a failure; define X2, X3, ... , X, analogously for the second selection, 
third selection, and so on. Each X; is a Bernoulli rv, and each has the same marginal 
distribution: p(1) = M/N and p(0) = 1 — MIN (this is obvious for X,, which is based 
on the very first draw from the population, and can be verified for the other draws as 
well). Thus E(X;) = 0(1 — M/N) + 1(M/N) = MIN. The total number of successes in 
the sample is X =X,+...+X,, (a 1 is added in for each success and a 0 for each 
failure), so 


E(X) = E(X1)+...+ E(X,) =M/N+M/N+...+M/N =n(M/N) = np 


where p denotes the success probability on any particular draw (trial). That is, just 
as in the case of a binomial rv, the expected value of a hypergeometric rv is the 
success probability on any trial multiplied by the number of trials. Notice that we 
were able to apply Statement | of the foregoing theorem, even though the X; are not 
independent. 

However, the variance of X here is not the same as the binomial variance because 
the successive draws are not independent. Consider the joint distribution of X, and 


X: 
pt.) =F (F=2), 90,0 = (GA) ), 


p(1,0) = p01) =5 (F—] 


This is also the joint pmf of any pair X;, X;. A slightly tedious calculation then 
results in 


Cov (X;, Xj) = _— Gi #/) 


Applying the variance formula from statement 1 of the theorem eventually 
yields 


Var(X) = Var(X; +...+X,) = nVar(X1) + n(n — 1)Cov(X1, X2) 


= np(1—p) (¥ — *) 


This is quite close to the binomial variance provided that n is much smaller than 
N so that the last term in parentheses is close to 1. a 
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The following corollary expresses the n = 2 case of the main theorem for ease of 
use, including the important special cases of the sum and the difference of two 
random variables. 


COROLLARY 
For any two rvs X, and X2, and any constants a), do, b, 


E(a,X, ae aX ae b) = aE (X,) tp aE (X2) +b 
and 
Var(a,X1 + a2X2 + b) = ajVar(X1) + a5 Var(X2) + 2a;a,Cov(X1, X2) 


In particular, E(X, + X2) = E(X,) + E(X2) and, if X; and X> are independent, 
Var(X, +X») = Var(X,) + Var(X>).' Also, E(X, — X) = E(X,) — E(X>) and, if 
X, and X> are independent, 


Var(X; — X2) = Var(X1) + Var(X2). 


The expected value of a difference is the difference of the two expected values, 
but the variance of a difference between two independent variables is the sum, not 
the difference, of the two variances. There is just as much variability in X,; — X> as 
in X, +X: writing X; — X. =X, + (-1)X2, (-1)X2 has the same amount of variability 
as X> itself. 


Example 4.20 An automobile manufacturer equips a particular model with either a 
six-cylinder engine or a four-cylinder engine. Let X, and X> be fuel efficiencies for 
independently and randomly selected six-cylinder and four-cylinder cars, respec- 
tively. With wz, = 22, wo = 26, 0; = 1.2, and o.= 1.5, 


E(X, — Xo) =, — fy = 22-26 = -4 


Var(X, — Xo) = 07 +o} = 1.27 + 1.5" = 3.69 
SD(X; — X2) = V3.69 = 1.92 


If we relabel so that X, refers to the four-cylinder car, then E(X,—X2)= 
26 — 22 =4, but the variance of the difference is still 3.69. a 


' This property of independent rvs can also be written as SD(X i + SD(X3)” =SD(X, +X). In part 
because the formula has the format a*+b*=c”, statisticians sometimes call this property the 
Pythagorean Theorem. 
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4.3.1 The PDF of a Sum 


Generally speaking, knowing the mean and standard deviation of a random variable 
W is not enough to specify its probability distribution and thus be able to compute 
probabilities such as P(W > 10) or P(W <-2). In the case of independent rvs, a 
general method exists for determining the pdf of the sum X,+---+X,, from their 
marginal pdfs. We present first the result for two random variables. 


THEOREM 
Suppose X and Y are independent, continuous rvs with marginal pdfs f(x) and 
Jyv(y), respectively. Then the pdf of the rv W=X + Y is given by 


fut) =| falar adds 


[In mathematics, this integral operation is known as the convolution of 
fx(x) and f(y) and is sometimes denoted fy =fx * fy.] The limits of integra- 
tion are determined by which x values make both fy(x) > 0 and fy(w — x) > 0. 


Proof Since X and Y are independent, their joint pdf is given by fy(x) -fy(y). The 
cdf of W is then 


Fy(w) = P(W <w)=P(X+Y<w) 
To calculate P(X + Y<w), we must integrate over the set of numbers {(x, y): 


x + y<w}, which is the shaded region indicated in Fig. 4.6. 
The resulting limits of integration are -oo < x < oo and -oo < y <w — x, and so 


Fig. 4.6 Region of Y, 
integration for P(X + ¥Y<w) 
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Fy(w) =P (x +¥< w) : = 
=| f feerronaae=[ x9] frye 


—0oO 
10.@) 


= | FAR Dai 


The pdf of W is the derivative of this expression with respect to w; taking the 
derivative underneath the integral sign yields the desired result. a 


By a similar argument, the pdf of W= X + Y can be determined even when X and 
Y are not independent. Assuming X and Y have joint pdf f(x, y), 


fu) =| Flaw —x)ax 
Example 4.21 In a standby system, a component is used until it wears out and is 
then immediately replaced by another, not necessarily identical, component. (The 
second component is said to be “in standby mode,” i.e., waiting to be used.) The 
overall lifetime of a standby system is just the sum of the lifetimes of its individual 
components. Let X and Y denote the lifetimes of the two components of a standby 
system, and suppose X and Y are independent exponentially distributed random 
variables with expected lifetimes 3 weeks and 4 weeks, respectively. LetW=X + Y, 
the lifetime of the standby system. 

Using the first theorem of this section, the expected lifetime of the standby 
system is E(W) = E(X) + E(Y) = 3+4=7 weeks. Since X and Y are exponential, the 
variance of each one is the square of its mean (9 and 16, respectively); since they are 
also independent, 


Var(W) = Var(X) + Var(Y) = 3? +4? = 25 


It follows that SD(W) =5 weeks. Since pw A ow, W cannot itself be exponentially 
distributed, but we can use the previous theorem to find its pdf. 

The marginal pdfs of X and Y are f(x) = (1/3)e~? for x > 0 and f(y) = (1/4)e"* 
for y>0O. Substituting y=w—.x, the inequalities x>0 and w—x>0O imply 
0<.x<w, which are the limits of integration of the convolution integral: 

fut) =| faledfrw— ape =] "(1/3)" (1/4 yer" 
—oo 0 
= i ewl ° et! 2 dy 
12 0 


=ew4(1—e"/2), w>0 


The pdf of W appears in Fig. 4.7. As a check, the mean and variance of W can be 
verified directly from its pdf. 

The probability the standby system lasts more than its expected lifetime of 
7 weeks is given by 
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Fig. 4.7 The pdf of Sw) 
W=X+Y for Example 4.21 A 
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As a generalization of the previous proposition, the pdf of the sum W=X,+--- 
+X,, of n independent, continuous rvs can be determined by successive convolution: 
Jw=fi *---*/f,. In most situations, it isn’t practical to evaluate such a compli- 
cated object. Thankfully, as we’ll see next, such tedious computations can some- 
times be avoided with the use of moment generating functions. 


4.3.2 Moment Generating Functions for Linear Combinations 


A corollary in Sect. 4.2 stated that the expected value of a product of functions of 
independent random variables is the product of the individual expected values. We 
now use this to formulate the moment generating function of a linear combination 
of independent random variables. 


PROPOSITION 
Let X,, X2, ..., X, be independent random variables with moment generating 
functions My,(t),My,(t),...,Mx,(t), respectively. Then the moment 


generating function of the linear combination Y= a,X, +. a2X2+-:-+a,X,+bis 


(continued) 
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My(t) = eMx, (ait) -My,(aot)- --- » Mx, (dnt) 
In the special case that aj = a) =---=a,=1 andb=0,soY=X,+---+X,, 
My(t) = Mx,(t)-Mx,(t)- +> -Mx,() 


That is, the mgf of a sum of independent rvs is the product of the 
individual mgfs. 


Proof First, we write the moment generating function of Y as the expected value of 
a product. 


My(t) = Ele” = Bl ener rer?) 
= a ala = ere . etka o. ewe. te gn 


The last expression inside brackets is the product of functions of X,, Xo, ..., Xp. 
Since the X; are independent, the expected value can be distributed across this 
product: 


e Elen 7 e2!X2 Aiea 8 entXn] = Ee ea . E\e@@| ore: Elem] 


= e'My, (ait) - My, (ant) eas - My, (ant) |_| 

Now suppose we wish to determine the pdf of some linear combination of 

independent rvs. Provided we have their mgfs, the previous proposition makes it 

easy to determine the mef of the linear combination. Then, if we can recognize this 

megf as belonging to some known distributional family (binomial, exponential, etc.), 

the uniqueness property of mgfs guarantees our linear combination has that partic- 
ular distribution. The next several propositions illustrate this technique. 


PROPOSITION 

If X,, Xo, ..., X, are independent, normally distributed rvs (with possibly 
different means and/or sds), then any linear combination of the X;s also has a 
normal distribution. In particular, the sum of independent normally 
distributed rvs itself has a normal distribution, and the difference X, — X> 
between two independent, normally distributed variables is itself normally 
distributed. 


Proof Let Y=a,X, + a:X.+---+a,X,+b, where X; is normally distributed 
with mean yp; and standard deviation o;, and the X; are independent. From 


Sect. 3.3, My, (t) = eitto'"/2_ Therefore, 
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My(t) = e'My, (ait) - Mx, (apt) vee - Mx, (ant) 


= bteMaitt ota? /ounaatralb [2 .., ofatnttotat? 2 
— pli aithada te +p ydn +b) (ota? toxa5te-4 o2a>) 0? /2 
= elite? /2 


where f= @) fl, + Goflot+-+++a,",+b and o° = ao% +a505+---+a202. We recog- 
nize this function as the mgf of a normal random variable, and it follows by the 
uniqueness property of mgfs that Y is normally distributed . Notice that the mean 
and variance are in agreement with the first proposition of this section. a 


Example 4.22 (Example 4.18 continued) The total revenue from the sale of the 
three grades of gasoline on a particular day was Y = 3.5X, + 3.65X2 +3.8X3, and we 
calculated ~y=6465 and (assuming independence) oy = 493.83. If the X;s are 
normally distributed, the probability that revenue exceeds 5000 is 


5000 — 6465 


P(Y =plzZ 
em) (z> 493.83 


) = P(Z > —2.967) = 1 — ®(—2.967) = .9985 
i} 


This same method may be applied to Poisson rvs, as the next proposition 
indicates. 


PROPOSITION 
Suppose X,, ..., X,, are independent Poisson random variables, where X; has 
mean yw;. Then Y=X,+---+X,, also has a Poisson distribution, with mean 


My tess Fn. 


Proof From Sect. 2.7, the mgf of a Poisson rv with mean p is e“~!). Since Y is the 
sum of the X;s, and the X;s are independent, 


My(t) = My, (t):--Myx,(t) = CED). ohn —1) pleat +Hn)(e 1) 


This is the mgf of a Poisson rv with mean y,+---+,,. Therefore, by the 
uniqueness property of mgfs, Y has a Poisson distribution with mean w,+---+,. 


Example 4.23 During the open enrollment period at a large university, the number 
of freshmen registering for classes through the online registration system in | h 
follows a Poisson distribution with mean 80 students; denote this rv by X;. Define 
X2, X3, and X4 similarly for sophomores, juniors, and seniors, and suppose the 
corresponding means are 125, 118, and 140, respectively. Assume these four counts 
are independent. The rv Y=X,+X+X3+X4 represents the total number of 
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undergraduate students registering in | h; by the preceding proposition, Y is 
also a Poisson rv, but with mean 80+ 125+ 118+ 140 = 463 students and standard 
deviation /463 = 21.5 students. The probability that more than 500 students enroll 
during 1 h, exceeding the registration system’s capacity, is then P(Y >500)= 
1 — P(Y < 500) = .042 (software was used to perform the calculation). | 


Because of the properties stated in the preceding two propositions, both the 
normal and Poisson models are sometimes called additive distributions, meaning 
that the sum of independent rvs from that family (normal or Poisson) will also 
belong to that family. The next proposition shows that not all of the major 
probability distributions are additive; its proof is left as an exercise (Exercise 65). 


PROPOSITION 

Suppose X;, ..., X, are independent exponential random variables with 
common parameter 2. Then Y= X,+---+X,, has a gamma distribution, with 
parameters a=n and /= 1/4 (aka the Erlang distribution). 


Notice this proposition requires the X; to have the same “rate” parameter 4, 1.e., 
the X; must be independent and identically distributed. As we saw in Example 4.21, 
the sum of two independent exponential rvs with different parameters does not 
follow an exponential distribution. 


4.3.3 Exercises: Section 4.3 (43-65) 


43. A shipping company handles containers in three different sizes: (1) 27 ft’ 
(3 x 3 x3), (2) 125 ft®, and (3) 512 ft®. Let X; (i=1, 2, 3) denote the 
number of type / containers shipped during a given week. With u;= E(X;) 
and o;=SD(X;), suppose the mean values and standard deviations are as 
follows: 


wu, = 200 Hy = 250 Hw, = 100 
Oo; = 10 02. = 12 03 = 


(a) Assuming that X,, X2, X3 are independent, calculate the expected value 
and standard deviation of the total volume shipped. [Hint: 
Volume = 27X, + 125X, + 512X3.] 

(b) Would your calculations necessarily be correct if the X;s were not inde- 
pendent? Explain. 

(c) Suppose that the X;s are independent with each one having a normal 
distribution. What is the probability that the total volume shipped is at 
most 100,000 ft*? 
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44. 


45. 


46. 


47. 


48. 


Let X,, X2, and X3 represent the times necessary to perform three successive 
repair tasks at a service facility. Suppose they are independent, normal rvs 
with expected values jz), 2, and 3 and variances 61,03, and 03, respectively. 
(a) If wy =po= 3 = 60 and of = 03 = 03 = 15, calculate P(X, +X>+X3 < 200). 
(b) Using the y;s and o0;s given in part (a), what is P(150 < X, + X2+X3 < 200)? 
(c) Using the m;s and ojs given in part (a), calculate P(55 < Xx) and 
P(58 < X < 62). [As noted at the beginning of this section, X denotes 
the sample mean, so here X = (X; + X2 + X3)/3.] 
(d) Using the ps and of given in part (a), calculate 
P(--10 < X, — 5X2 — 5X3 <5). 
(e) If 4; =40, po =50, w3=60, of =10, of =12, and of = 14, calculate 
P(X, +X24+X3 < 160) and P(X, + Xz > 2X3). 
Five automobiles of the same type are to be driven on a 300-mile trip. The first 
two have six-cylinder engines, and the other three have four-cylinder engines. 
Let X,, X2, X3, X4, and Xs be the observed fuel efficiencies (mpg) for the five 
cars. Suppose these variables are independent and normally distributed with 
My = Mz = 20, 13 = 4 = Ms = 22, and o* = 4 for the smaller engines and 3.5 for 
the larger engines. Define an rv Y by 


XM 4+X_ Xe +X4+Xs 
| 3 


Y 


so that Y is a measure of the difference in efficiency between the six-cylinder 

and four-cylinder engines. Compute P(O<Y) and P(-1<Y<1). [Hint: 

Y=a,X, +---+a5X5, with a, =i; 6.05 = -+.] 

Exercise 28 introduced random variables X and Y, the number of cars and buses, 

respectively, carried by a ferry on a single trip. The joint pmf of X and Y is given 

in the table in Exercise 9. It is readily verified that X and Y are independent. 

(a) Compute the expected value, variance, and standard deviation of the total 
number of vehicles on a single trip. 

(b) If each car is charged $3 and each bus $10, compute the expected value, 
variance, and standard deviation of the revenue resulting from a single trip. 

A concert has three pieces of music to be played before intermission. The time 

taken to play each piece has a normal distribution. Assume that the three times 

are independent of each other. The mean times are 15, 30, and 20 min, 
respectively, and the standard deviations are 1, 2, and 1.5 min, respectively. 

What is the probability that this part of the concert takes at most 1 h? Are there 

reasons to question the independence assumption? Explain. 

Refer to Exercise 3. 

(a) Calculate the covariance between X,; =the number of customers in the 
express checkout and X2, =the number of customers in the superexpress 
checkout. 

(b) Calculate Var(X, + X2). How does this compare to Var(X,) + Var(X2)? 
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49. Suppose your waiting time for a bus in the morning is uniformly distributed on 


[0, 8], whereas waiting time in the evening is uniformly distributed on [0, 10] 

independent of morning waiting time. 

(a) If you take the bus each morning and evening for a week, what is your 
total expected waiting time? [Hint: Define rvs X,, ..., X;9 and use a rule of 
expected value. ] 

(b) What is the variance of your total waiting time? 

(c) What are the expected value and variance of the difference between 
morning and evening waiting times on a given day? 

(d) What are the expected value and variance of the difference between total 
morning waiting time and total evening waiting time for a particular 
week? 


50. An insurance office buys paper by the ream (500 sheets) for use in the copier, 


51. 


52. 


fax, and printer. Each ream lasts an average of 4 days, with standard deviation 

1 day. The distribution is normal, independent of previous reams. 

(a) Find the probability that the next ream outlasts the present one by more 
than 2 days. 

(b) How many reams must be purchased if they are to last at least 60 days 
with probability at least 80%? 

If two loads are applied to a cantilever beam as shown in the accompanying 

drawing, the bending moment at 0 due to the loads is a,X, + a2X>. 


x} X 
74 Y Y 
Ai 1 1 

aq dy 


0 


(a) Suppose that X, and X> are independent rvs with means 2 and 4 kips, 
respectively, and standard deviations .5 and 1.0 kip, respectively. If 
a, =5 ft and a,= 10 ft, what is the expected bending moment and what 
is the standard deviation of the bending moment? 

(b) If X,; and X> are normally distributed, what is the probability that the 
bending moment will exceed 75 kip-ft? 

(c) Suppose the positions of the two loads are random variables. Denoting 
them by A, and Az, assume that these variables have means of 5 and 10 ft, 
respectively, that each has a standard deviation of .5, and that all A;s and 
X;s are independent of each other. What is the expected moment now? 

(d) For the situation of part (c), what is the variance of the bending moment? 

(e) If the situation is as described in part (a) except that Corr(X,, X2)=.5 
(so that the two loads are not independent), what is the variance of the 
bending moment? 

One piece of PVC pipe is to be inserted inside another piece. The length of the 
first piece is normally distributed with mean value 20 in. and standard deviation 
.) in. The length of the second piece is a normal rv with mean and standard 


43 


53. 


54. 


55. 
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deviation 15 in. and .4 in., respectively. The amount of overlap is normally 
distributed with mean value | in. and standard deviation .1 in. Assuming that 
the lengths and amount of overlap are independent of each other, what is the 
probability that the total length after insertion is between 34.5 and 35 in.? 
Two airplanes are flying in the same direction in adjacent parallel corridors. At 
time ¢= 0, the first airplane is 10 km ahead of the second one. Suppose the speed 
of the first plane (km/h) is normally distributed with mean 520 and standard 
deviation 10 and the second plane’s speed, independent of the first, is also 
normally distributed with mean and standard deviation 500 and 10, respectively. 
(a) What is the probability that after 2 h of flying, the second plane has not 
caught up to the first plane? 
(b) Determine the probability that the planes are separated by at most 10 km 
after 2 h. 
Three different roads feed into a particular freeway entrance. Suppose that 
during a fixed time period, the number of cars coming from each road onto the 
freeway is a random variable, with expected value and standard deviation as 
given in the table. 


Road 1 Road 2 Road 3 
Expected value 800 1000 600 
Standard deviation 16 DS 18 


(a) What is the expected total number of cars entering the freeway at this point 
during the period? [Hint: Let X;= the number from road /.] 

(b) What is the standard deviation of the total number of entering cars? Have 
you made any assumptions about the relationship between the numbers of 
cars on the different roads? 

(c) With X; denoting the number of cars entering from road i during the period, 
suppose that Cov(X,, X2) = 80, Cov(X;, X3) =90, and Cov(X>, X3) = 100 
(so that the three streams of traffic are not independent). Compute the 
expected total number of entering cars and the standard deviation of the total. 

Suppose we take a random sample of size n from a continuous distribution having 

median 0 so that the probability of any one observation being positive is .5. We 

now disregard the signs of the observations, rank them from smallest to largest in 
absolute value, and then let W= the sum of the ranks of the observations having 
positive signs. For example, if the observations are —.3, +.7, +2.1, and —2.5, then 
the ranks of positive observations are 2 and 3, so W=5. In statistics literature, 

W is called Wilcoxon’ s signed-rank statistic. W can be represented as follows: 

W=1-¥,42-¥243-¥3+---t+0-Yn = iY; 
i=1 


where the Yj;s are independent Bernoulli rvs, each with p=.5 (Y;=1 
corresponds to the observation with rank 7 being positive). Compute the 
following: 
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(a) E(Y;) and then E(W) using the equation for W [Hint: The first n positive 
integers sum to n(n + 1)/2.] 

(b) Var(Y;) and then Var(W) [Hint: The sum of the squares of the first 
Nn positive integers is n(n + 1)(2n + 1)/6.] 

In Exercise 51, the weight of the beam itself contributes to the bending 

moment. Assume that the beam is of uniform thickness and density so that 

the resulting load is uniformly distributed on the beam. If the weight of the 
beam is random, the resulting load from the weight is also random; denote this 
load by W (kip-ft). 

(a) If the beam is 12 ft long, W has mean 1.5 and standard deviation .25, and the 
fixed loads are as described in part (a) of Exercise 51, what are the expected 
value and variance of the bending moment? [Hint: If the load due to the 
beam were w kip-ft, the contribution to the bending moment would be 
wf, xdx.] 

(b) If all three variables (X,, X2, and W) are normally distributed, what is the 
probability that the bending moment will be at most 200 kip-ft? 

A professor has three errands to take care of in the Administration Building. Let 
X;= the time that it takes for the ith errand (i= 1, 2, 3), and let X4= the total 
time in minutes that she spends walking to and from the building and between 
each errand. Suppose the X;s are independent, normally distributed, with the 
following means and standard deviations: w,;=15, 6, =4, w2=5, o.=1, 
3 =8, 03=2, w4=12, o4=3. She plans to leave her office at precisely 
10:00 a.m. and wishes to post a note on her door that reads, “I will return by 
t a.m.” What time ¢ should she write down if she wants the probability of her 
arriving after f to be .01? 
In an area having sandy soil, 50 small trees of a certain type were planted, and 
another 50 trees were planted in an area having clay soil. Let X = the number of 
trees planted in sandy soil that survive | year and Y=the number of trees 
planted in clay soil that survive | year. If the probability that a tree planted in 
sandy soil will survive | year is .7 and the probability of 1-year survival in clay 
soil is .6, compute an approximation to P(-5 < X — Y <5). [Hint: Use a normal 
approximation from Sect. 3.3. Do not bother with the continuity correction. ] 

Let X and Y be independent rvs, with X ~ N(0, 1) and Y~ N(O, 1). 

(a) Use convolution to show that X + Y is also normal, and identify its mean 
and standard deviation. 

(b) Use the additive property of the normal distribution presented in this 
section to verify your answer to part (a). 

Karen throws two darts at a board with radius 10 in.; let X and Y denote the 

distances of the two darts from the center of the board. Under the system Karen 

uses, the score she receives depends upon W=X +, the sum of these two 
distances. Assume X and Y are independent. 

(a) Suppose X and Y are both uniform on the interval [0, 10]. Use convolution 
to determine the pdf of W=X+Y. Be very careful with your limits of 
integration! 
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(b) Based on the pdf in part (a), calculate P(X + Y <5). 

(c) If Karen’s darts are equally likely to land anywhere on the board, it can 
be shown that the pdfs of X and Y are fx(x)=x/50 for O<x< 10 and 
fry) =y/50 for O< y<10. Use convolution to determine the pdf of 
W=X+Y. Again, be very careful with your limits of integration. 

(d) Based on the pdf in part (c), calculate P(X + Y <5). 

Siblings Matt and Liz both enjoy playing roulette. One day, Matt brought $10 

to the local casino and Liz brought $15. They sat at different tables, and each 

made $1 wagers on red on consecutive spins (10 spins for Matt, 15 for Liz). Let 

X =the number of times Matt won and Y = the number of times Liz won. 

(a) What is a reasonable probability model for X? [Hint: Successive spins of a 
roulette wheel are independent, and P(land on red) = 18/38.] 

(b) What is a reasonable probability model for Y? 

(c) What is a reasonable probability model for X + Y, the total number of times 
Matt and Liz win that day? Explain. [Hint: Since the siblings sat at different 
table, their gambling results are independent. ] 

(d) Use moment-generating functions, along with your answers to (a) and (b), 
to show that your answer to part (c) is correct. 

(e) Generalize part (d): If X,, ..., X, are independent binomial rvs, with 
X;~ Bin(n,, p), show that their sum is also binomially distributed. 

(f) Does the result of part (e) hold if the probability parameter p is different for 
each X; (e.g., if Matt bets on red but Liz bets on the number 27)? 

The children attending Milena’s birthday party are enjoying taking swings at a 

pinata. Let X =the number of swings it takes Milena to hit the pinata once 

(since she’s the birthday girl, she goes first), and let Y = the number of swings it 

takes her brother Lucas to hit the pifiata once (he goes second). Assume the 

results of successive swings are independent (the children don’t improve, since 
they’re blindfolded), and that each child has a .2 probability of hitting the 
pifiata on any attempt. 

(a) What is a reasonable probability model for X? 

(b) What is a reasonable probability model for Y? 

(c) What is a reasonable probability model for X+Y, the total number of 
swings taken by Milena and Lucas? Explain. (Assume Milena’s and 
Lucas’ results are independent.) 

(d) Use moment-generating functions, along with your answers to (a) and (b), 
to show that X + Y has a negative binomial distribution. 


(e) Generalize part (d): If X,, ..., X, are independent geometric rvs with 
common parameter p, show that their sum has a negative binomial 
distribution. 


(f) Does the result of part (e) hold if the probability parameter p is different for 
each X; (e.g., if Milena has probability .4 on each attempt while Lucas’ 
success probability is only .1)? 

Let X,,..., X,, be independent rvs, with X; having a negative binomial distribu- 

tion with parameters 7; and p (i= 1, ..., n). Use moment generating functions to 

show that X,+---+X, has a negative binomial distribution, and identify the 
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parameters of this distribution. Explain why this answer makes sense, based on 

the negative binomial model. [Note: Each X; may have a different parameter r;, 

but all have the same p parameter.] 

Let X and Y be independent gamma random variables, both with the same scale 

parameter /. The value of the shape parameter is a, for X and a, for Y. Use 

moment generating functions to show that X + Y is also gamma distributed, with 

shape parameter a, +a@ and scale parameter /. Is X+ Y gamma distributed if 

the scale parameters are different? Explain. 

Let X and Y be independent exponential random variables with common 

parameter J. 

(a) Use convolution to show that X + Y has a gamma distribution, and identify 
the parameters of that gamma distribution. 

(b) Use the previous exercise to establish the same result. 

(c) Generalize part (b): If X,, ..., X,, are independent exponential rvs with 
common parameter 4, what is the distribution of their sum? 


4.4 Conditional Distributions and Conditional Expectation 


The distribution of Y can depend strongly on the value of another variable X. For 
example, if X is height and Y is weight, the distribution of weight for men who are 
6 ft tall is very different from the distribution of weight for short men. The 
conditional distribution of Y given X =x describes for each possible x value how 
probability is distributed over the set of y values. We define below the conditional 
distribution of Y given X, but the conditional distribution of X given Y can be 
obtained by just reversing the roles of X and Y. Both definitions are analogous to 
that of the conditional probability, P(AIB), as the ratio P(A M B)/P(B). 


DEFINITION 

Let X and Y be two discrete random variables with joint pmf p(x,y) and 
marginal X pmf py(x). Then for any x value such that py(x) > 0, the condi- 
tional probability mass function of Y given X = x is 


P(x,y) 
Px (x) 


Py\x (yf) = 


An analogous formula holds in the continuous case. Let X and Y be two 
continuous random variables with joint pdf f(x,y) and marginal X pdf fx(x). 
Then for any x value such that fy(x) > 0, the conditional probability density 
function of Y given X =x is 


Frye) = 
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Example 4.24 Fora discrete example, reconsider Example 4.1, where X represents 
the deductible amount on an automobile policy and Y represents the deductible 
amount on a homeowner’s policy. Here is the joint distribution again. 


y 
p(x, y) 0 100 200 
100 20 10 20 50 
* 250 05 15 30 50 
25 25 50 


The distribution of Y depends on X. In particular, let’s find the conditional 
probability that Y is 200, given that X is 250, first using the definition of conditional 
probability from Sect. 1.4: 


P(Y = 200. X = 250) 30 


P(X = 250) a ec 


P(Y = 200|X = 250) = 


With our new definition we obtain the same result: 


(250,200) .30 
200| 250) = = = 6 


The conditional probabilities for the two other possible values of Y are 


p(250,0) .05 
2 = —__ = — = 1.1 
p(250,100) 15 
100| 250) = = = 3 
Pyix(100|250) =" F950) 50 


Notice that pyx(01250) + pyx(1001250) + pyix(2001250) = .1+.3+.6=1. This is 
no coincidence: conditional probabilities satisfy the properties of ordinary 
probabilities (i.e., they are nonnegative and they sum to 1). Essentially, the denom- 
inator in the definition of conditional probability is designed to make the total be 1. 

Reversing the roles of X and Y, we find the conditional distribution for X, given 


that Y=0: 


p(100,0) 20 
1 = = — Fr 
PaO) == oy = aos 
p(250,0) 05 
250|0) = = a9 
Pxw(250/0) =~ 0) = 204 05 


Again, the conditional probabilities add to 1. = 
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Example 4.25 For a continuous example, recall Example 4.5, where X is the 
weight of almonds and Y is the weight of cashews in a can of mixed nuts. The 
sum of X + Y is at most | |b, the total weight of the can of nuts. The joint pdf of X and 
Y is 

_ J 24xy O0<x<1, O<y<l, x+y<l 
P= { 0 otherwise 


and in Example 4.5 it was shown that 


_ fiixi—2? 0<*<1 
Px) = { 0 otherwise 


Thus, the conditional pdf of Y given that X =x is 


fry) 24ey Dy ae 
Fux 12) ="5 ~ 12x(1—x)? (1—x)? Osysl~x 


This can be used to get conditional probabilities for Y. For example, 


25 25 2y 97 25 
porsasix=s)=[ fol o=[> 7 dy= [w']g° = 25 


5) 
Given that the weight of almonds (X) is .5 Ib, the probability is .25 for the weight of 
cashews (Y) to be less than .25 Ib. 

Just as in the discrete case, the conditional distribution assigns a total probability 


of | to the set of all possible Y values. That is, integrating the conditional density 
over its set of possible values should yield 1: 


lee) l=x 2 2 a 
[fro | x)dy = l a a dy = fend 


0 


I 


Whenever you calculate a conditional density, we recommend doing this inte- 
gration as a validity check. a 


4.4.1 Conditional Distributions and Independence 


Recall that in Sect. 4.1 two random variables were defined to be independent if their 
joint pmf or pdf factors into the product of the marginal pmfs or pdfs. We can 
understand this definition better with the help of conditional distributions. For 
example, suppose there is independence in the discrete case. Then 
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P(x, y) _ Px(x)py(y) = y(y) 


Puc 1) =") pyle) 


That is, independence implies that the conditional distribution of Y is the same as 
the unconditional (i.e., marginal) distribution, and that this is true no matter the 
value of X. The implication works in the other direction, too. If pyix(ylx) = py(y), 
then 


= py(y) 


so p(x, y) = px(x) py(y), and therefore X and Y are independent. 

In Example 4.7 we said that independence necessitates the region of positive 
density being a rectangle (possibly infinite in extent). In terms of conditional 
distributions, this region tells us the domain of Y for each possible x value. For 
independence we need to have the domain of Y not be dependent on X, so the 
interval of positive density must be the same for each x, implying a rectangular 
region. 


4.4.2. Conditional Expectation and Variance 


Because the conditional distribution is a valid probability distribution, it makes 
sense to define the conditional mean and variance. 


DEFINITION 

Let X and Y be two discrete random variables with conditional probability 
mass function pyx(ylx). Then the conditional expectation (or conditional 
mean) of Y given X =x is 


Hy|x=x = E (Y¥|X=x)= a  Py\x(y | x) 


Analogously, for two continuous rvs X and Y with conditional probability 
density function fyx(yly), 


Hyer = EU |X=2) =| y- Fal |) dy 
More generally, the conditional mean of any function h(Y) is given by 


(continued) 
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X h(y) + py\x(y|x) (discrete case) 
E(A(Y)|X =x) = i 


h(y) -fy\x(y|x)dy (continous case) 


—0o 
In particular, the conditional variance of Y given X = x is 


OF yee — Wary |X —2) = El(V — py ipa.) |X — 
= E(Y"|X =x) — pias 


Example 4.26 Having previously found the conditional distribution of Y given 
X = 250 in Example 4.24, we now compute the conditional mean and variance. 


My|x-250=E(Y [X=250) =0-py)y(0|250) + 100-py y(100|250) + 200- py, (200|250) 
=0(.1) +100(.3)+200(.6)=150 


The average homeowner’s policy deductible, among customers with a $50 auto 
deductible, is $150. Given that the possibilities for Y are 0, 100, and 200 and most of 
the probability is on the latter two values, it is reasonable that the conditional mean 
should be between 100 and 200. 

Using the alternative (shortcut) formula for the conditional variance requires 
first obtaining the conditional expectation of Y7: 


E(Y¥? |X = 250) = 0°py | x(0|250) + 100°py) y(100| 250) + 2007py, (200 | 250) 
= 0°(.1) + 1007(.3) + 200°(.6) = 27,000 


Thus, 
OF y=250 = Var(¥|X = 250) = E(¥?|X = 250) — pyyy~959 = 27,000 — 150? = 4500 


Taking the square root gives oy|x —259= $67.08, which is in the right ballpark 
when we recall that the possible values of Y are 0, 100, and 200. | 


Example 4.27 (Example 4.25 continued) Suppose a 1-lb can of mixed nuts 
contains .1 lbs of almonds (i.e., we know that X = .1). Given this information, the 
amount of cashews Y in the can is constrained by O< y<1—x=.9, and the 
expected amount of cashews in such a can is 


a] 2a 2 
euix=) =| y-frxolddr= | »-7 7 a= 6 
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The conditional variance of Y given that X= .1 is 


9 


9 
Var(Y|X =.1) = | (y — 6)" -fy)x(y| Day -| Oey qo yt 045 


Using the aforementioned shortcut, this can also be calculated in two steps: 


ao) aS) 2y 
BY |x=1)= | y -fyjx(y| Lay =| y. i ye = 405 


=> Var(¥ |X = .1) = 405 — (.6)? = .045 


More generally, conditional on X =x lbs (where 0 <x < 1), integrals similar to 
those above can be used to show that the conditional mean amount of cashews is 
2(1 — x)/3, and the corresponding conditional variance is (1 — x)/ 18. This formula 
implies that the variance gets smaller as the weight of almonds in a can approaches 
1 lb. Does this make sense? When the weight of almonds is | lb, the weight of 
cashews is guaranteed to be 0, implying that the variance is 0. Indeed, Fig. 4.2 
shows that the set of possible y-values narrows to 0 as x approaches 1. = 


4.4.3 The Laws of Total Expectation and Variance 


By the definition of conditional expectation, the rv Y has a conditional mean for 
every possible value x of the variable X. In Example 4.26, we determined the mean 
of Y given that X = 250, but a different mean would result if we conditioned on 
X= 100. For the continuous rvs in Example 4.27, every value x between 0 and 
1 yielded a different conditional mean of Y (and, in fact, we even found a general 
formula for this conditional expectation). As it turns out, these conditional means 
can be related back to the unconditional mean of Y, 1.e., wy. Our next example 
illustrates the connection. 


Example 4.28 Apartments in a certain city have x = 0, 1, 2, or 3 bedrooms (0 for a 
studio apartment), and y= 1, 1.5, or 2 bathrooms. The accompanying table gives 
the proportions of apartments for the various number of bedroom/number of 
bathroom combinations. 


y 
pix,y)| 1 15 2 
0 10 00 00 


WN rR 
— 
Nr 
— 
oO 
— 
Nm 
NY BWR 
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Let X and Y denote the number of bedrooms and bathrooms, respectively, in a 
randomly selected apartment in this city. The marginal distribution of Y comes from 
the column totals in the joint probability table, from which it is easily verified that 
E(Y)=1.385 and Var(Y)=.179275. The conditional distributions (pmfs) of 
Y given that X =x for x=0, 1, 2, and 3 are as follows: 


x=0: py y=o) = 1 (all studio apartments have one bathroom) 
Xx=1:  pyjy-(1) = 667,  py;y-)(1-5) = 267, py) x (2) = 067 

x=2: pyjyeo(1) = 375, pyjxan(15)=.25, py )xan(2) = 375 

x=3: Py|x=3(1) = .25, Py |x=3(1.5) = .25, Py |x=3(2) = 50 


From these conditional pmfs, we obtain the expected value of Y given X = x for 
each of the four possible x values: 


E(Y|X = 0) =1, E(Y|X = 1) = 1.2, E(Y|X = 2) = 1.5, E(Y|X = 3) = 1.625 


So, on the average, studio apartments have | bathroom, one-bedroom apartments 
have 1.2 bathrooms, 2-bedrooms have 1.5 baths, and luxurious 3-bedroom 
apartments have 1.625 baths. 

Now, instead of writing E(YIX =x) for some specific value x, let’s consider the 
expected number of bathrooms for an apartment of randomly selected size, X. This 
expectation, denoted E(YIX), is itself a random variable, since it is a function of the 
random quantity X. Its smallest possible value is 1, which occurs when X = 0, and 
that happens with probability .1 (the sum of probabilities in the first row of the joint 
probability table). Similarly, the random variable E(Y1X) takes on the value 1.2 with 
probability py(1) = .3. Continuing in this manner, the probability distribution of the 
rv E(Y1X) is as follows: 


Value of E(YIX) | 1 2 415 1605 
Probability of value | .1 a 4 2 


(4.7) 


The expected value of this random variable, denoted E[E(Y1X)], is computed by 
taking the weighted average of the four values of E(YIX=x) against the 
probabilities specified by py(x), as suggested by (4.7): 


E[E(¥|X)] = 1(.1) + 1.2(.3) + 1.5(.4) + 1.625(.2) = 1.385 


But this is exactly E(Y), the expected number of bathrooms. a 


44 Conditional Distributions and Conditional Expectation 343 


LAW OF TOTAL EXPECTATION 
For any two random variables X and Y, 


E|E(Y|X)] = E(Y) 


(This is sometimes referred to as computing E(Y) by means of iterated 
expectation.) 


The Law of Total Expectation says that E(Y) is a weighted average of the 
conditional means E(YIX =x), where the weights are given by the pmf or pdf of 
X. It is analogous to the Law of Total Probability, which describes how to find P(B) 
as a weighted average of conditional probabilities P(BIA)). 


Proof Here is the proof when both rvs are discrete; in the jointly continuous case, 
simply replace summation by integration and pmfs by pdfs. 


E[E(Y |X)} = D0 EY |X =x)px(x) = DO ypyxv | Dpx@) 


x © Dy x&Dy y©&Dy 


Oe yee pl) =>) 7 >) pe) 


x© Dy y&Dy ySDy x&Dy 


= \> ypy(y) = E(Y) 


y&Dy 


In Example 4.28, the use of iterated expectation to compute E(Y) is unnecessar- 
ily cumbersome; working from the marginal pmf of Y is more straightforward. 
However, there are many situations in which the distribution of a variable Y is only 
expressed conditional on the value of another variable X. For these so-called 
hierarchical models, the Law of Total Expectation proves very useful. 


Example 4.29 A ferry goes from the left bank of a small river to the right bank 
once an hour. The ferry can accommodate at most two vehicles. The probability that 
no vehicles show up is .1, than exactly one shows up is .7, and that two or more 
show up is .2 (but only two can be transported). The fare paid for a vehicle depends 
upon its weight, and the average fare per vehicle is $25. What is the expected fare 
for a single trip made by this ferry? 

Let X represent the number of vehicles that show up, and let Y denote the total 
fare for a single trip. The conditional mean of Y, given X, is given by E(Y1X) = 25X. 
So, by the Law of Total Expectation, 
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E(Y) = E[E(Y|X)] = £[25x] = Ss - Py(x)] 
= (0)(.1) + (25) (.7) + (50) (.2) = 27.50 
a 


The next theorem provides a way to compute the variance of Y by conditioning 
on the value of X. There are two contributions to Var(Y). The first part is the 
variance of the random variable E(YIX). The second part involves the random 
variable Var(YIX)—the variance of Y as a function of X—and in particular the 
expected value of this random variable. 


LAW OF TOTAL VARIANCE 
For any two random variables X and Y, 


Var(Y) = Var[E(Y | X)] + E[Var(Y|X)] 


Proving the Law of Total Variance requires some more sophisticated algebra; 
see Exercise 84. For those familiar with statistical methods, the Law of Total 
Variance is analogous to the famous ANOVA identity, wherein the total variability 
in a response variable Y can be decomposed into the differences between group 
means (here, the term Var[E(YIX)]) and the variation of responses within groups 
(represented by E[Var(YIX)] above). 


Example 4.30 Let’s verify the Law of Total Variance for the apartment scenario of 
Example 4.28. The pmf of the rv E(Y1X) appears in (4.7), from which its variance is 
given by 


Var[E(¥ |X)] = (1 — 1.385)7(.1) + (1.2 — 1.385)?(.3) 
+ (1.5 — 1.385)?(.4) + (1.625 — 1.385)?(.2) 
= 0.0419 


(Recall that 1.385 is the mean of the rv E(YIX), which, by the Law of Total 
Expectation, is also E(Y).) The second term in the Law of Total Variance involves 
the variable Var(YIX), which requires determining the conditional variance of 
Y given X =x for x=0, 1, 2, 3. Using the four conditional distributions displayed 
in Example 4.28, these are 
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Var(Y|X =0)=0; Var(Y|X = 1) = .0933 
Var(Y|X = 2) = .1875;  Var(Y|X = 3) = .171875 


The rv Var(YIX) takes on these four values with probabilities .1, .4, .3, and .2, 
respectively (again, these are inherited from the distribution of X). Thus, 


E[Var(Y |X)] = 0(.1) + .0933(.3) + .1875(.4) + .171875(.2) = .137375 


Combining, Var[E(Y1X)] + E[Var(V1X)] = .0419 + .137375 = .179275 
This is exactly Var(Y) computed using the marginal pmf of Y in Example 4.28, 
and the Law of Total Variance is verified for this example. a 


The computation of Var(Y) in Example 4.30 is clearly not efficient; it is much 
easier, given the joint pmf of X and Y, to determine the variance of Y from its 
marginal pmf. As with the Law of Total Expectation, the real worth of the Law of 
Total Variance comes from its application to hierarchical models, where the 
distribution of one variable (Y, say) is only known conditional on the distribution 
of another rv. 


Example 4.31 In the manufacture of ceramic tiles used for heat shielding, the 
proportion of tiles that meet the required thermal specifications varies from day to 
day. Let P denote the proportion of tiles meeting specifications on a randomly 
selected day, and suppose P can be modeled by the following pdf: 


f(p) = 9p* O<p<l 


At the end of each day, a random sample of n = 20 tiles is selected and each tile 
is tested. Let Y denote the number of tiles among the 20 that meet specifications; 
conditional on P=p, Y~Bin(20, p). Find the expected number of tiles meeting 
thermal specifications in a daily sample of 20, and find the corresponding standard 
deviation. 

From the properties of the binomial distribution, we know that E(YIP =p) = 
np = 20p, so E(YIP) = 20P. Applying the Law of Total Expectation, 


E(Y) = E[E(Y|P)] = E[20P] = | 20p - f(p)dp = |, 180p°dp = 18 


This is reasonable: since E(P)=.9 by integration, the expected proportion of 
good tiles is 90%, and thus the expected number of good tiles in a random sample of 
20 tiles is 18. 

Determining the standard deviation of Y requires the two pieces of the Law of 
Total Variance. First, using the rescaling property of variance, 
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Var[E(Y | P)| = Var(20P) = 20°Var(P) = 400Var(P) 


The variance of P can be determined directly from the pdf of P via integration. 
The result is Var(P) = 9/1100, so Var[E(YIP)] = 400(9/1 100) = 36/11. Second, the 
binomial variance formula np(1—p) implies that the conditional variance of 
Y given P is Var(YIP) = 20P(1 — P), so 


E[Var(¥ |P)] = E[20P(1 — P)] = | 20p(1 — p) - 9pSdp = = 


Therefore, by the Law of Total Variance, 


36 18 = «654 
Var(Y) = Var[E(Y | P)] + E[Var(Y|P)] = Wo aa 4.909, 
and the standard deviation of Y is oy = V4.909 = 2.22. This “total” standard 
deviation accounts for two effects: day-to-day variation in quality as modeled 
by P (the first term in the variance expression), and random variation in the number 
of observed good tiles as modeled by the binomial distribution (the second term). 


Here is an example where the Laws of Total Expectation and Variance are 
helpful in finding the mean and variance of a random variable that is neither discrete 
nor continuous. 


Example 4.32 The probability of a claim being filed on an insurance policy is .1, 
and only one claim can be filed. If a claim is filed, the amount is exponentially 
distributed with mean $1,000. Recall from Sect. 3.4 that the mean and standard 
deviation of the exponential distribution are the same, so the variance is the 
square of this value. We want to find the mean and variance of the amount paid. 
Let X be the number of claims (0 or 1) and let Y be the payment. We know that 
E(Y1X = 0) = 0 and E(Y1X = 1) = 1000. Also, Var(Y1X = 0) = 0 and Var(YIX = 1) = 
10007 = 1,000,000. Here is a table for the both the distribution of E(YIX =x) and 
that of Var(YIX = x): 


iB P(X =x) E(Y1X = x) Var(Y1LX = x) 
0 Y 0 0 
1 all 1000 1,000,000 


Therefore 
E(Y) = E[E(Y |X)] = E(Y |X = 0)P(X = 0) + E(Y |X = 1)P(X = 1) 
= 0(.9) + 1000(.1) = 100 


The average claim amount across all customers is $100. Next, the variance of the 
conditional mean is 
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Var[E(Y |X)] = (0 — 100)?(.9) + (1000 — 100)°(.1) = 90,000, 
and the expected value of the conditional variance is 
E[Var(Y |X)] = 0(.9) + 1,000,000(.1) = 100,000 
Apply the Law of Total Variance to get Var(Y): 
Var(Y) = Var[E(Y |X)] + E[Var(Y |X)] = 90,000 + 100,000 = 190,000 


Taking the square root gives the standard deviation, oy = $435.89. 

Suppose that we want to compute the mean and variance of Y directly. Notice 
that X is discrete, but the conditional distribution of Y given X = | is continuous. 
The random variable Y itself is neither discrete nor continuous, because it has 
probability .9 of being 0, but the other .1 of its probability is spread out from 0 to 
oo. Such “mixed” distributions may require a little extra effort to evaluate means 
and variances, although it is not especially hard in this case (because the discrete 
mass is at 0 and doesn’t contribute to expectations). 


E(Y) = (1) Yap amd = (.1) (1000) = 100 


BY) = (a) roe 10% dy = (.1)2(1000°) = 200,000 


Var(Y) = E(Y?) — [E(Y)]° = 200,000 — 10,000 = 190,000 


These agree with what we found using the theorems. a 


4.4.4 Exercises: Section 4.4 (66-84) 


66. Refer back to Exercise | of this chapter. 
(a) Given that X = 1, determine the conditional pmf of Y—that is, py,(Ol1) , 
Pyix(1I1), and pyix(2I1). 
(b) Given that two hoses are in use at the self-service island, what is the 
conditional pmf of the number of hoses in use on the full-service island? 
(c) Use the result of part (b) to calculate the conditional probability P(Y < 1 
X=2). 
(d) Given that two hoses are in use at the full-service island, what is the 
conditional pmf of the number in use at the self-service island? 
67. A system consists of two components. Suppose the joint pdf of the lifetimes of 
the two components in a system is given by f(x, y)=c[10— («+ y)] for x > 0, 
y >0, x+y < 10, where x and y are in months. 
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(a) If the first component functions for exactly 3 months, what is the proba- 
bility that the second functions for more than 2 months? 

(b) Suppose the system will continue to work only as long as both 
components function. Among 20 of these systems that operate indepen- 
dently of each other, what is the probability that at least half work for 
more than 3 months? 

The joint pdf of pressures for right and left front tires is given in Exercise 11. 

(a) Determine the conditional pdf of Y given that X =x and the conditional 
pdf of X given that Y=y. 

(b) If the pressure in the right tire is found to be 22 psi, what is the probability 
that the left tire has a pressure of at least 25 psi? Compare this to 
P(Y > 25). 

(c) If the pressure in the right tire is found to be 22 psi, what is the expected 
pressure in the left tire, and what is the standard deviation of pressure in 
this tire? 

Suppose that X is uniformly distributed between 0 and 1. Given X =x, Y is 

uniformly distributed between 0 and x”. 

(a) Determine E(Y1X = x) and then Var(Y1X = x). 

(b) Determine f(x,y) using fy(x) and fyx(ylx). 

(c) Determine fy(y). 

Consider three Ping-Pong balls numbered 1, 2, and 3. Two balls are randomly 

selected with replacement. If the sum of the two resulting numbers exceeds 

4, two balls are again selected. This process continues until the sum is at most 

4. Let X and Y denote the last two numbers selected. Possible (X, Y) pairs are 

{(, 1), CZ, 2), C, 3), 2, D, 2, 2), 3, I}. 

(a) Determine py y(x,y). 

(b) Determine py x(x). 

(c) Determine E(Y1X = x). 

(d) Determine E(XlY=y). What special property of p(x, y) allows us to get 
this from (c)? 

(e) Determine Var(Y1LX = x). 

Let X be a random digit (0, 1, 2,..., 9 are equally likely) and let Y be a random 

digit not equal to X. That is, the nine digits other than X are equally likely 

for Y. 

(a) Determine px(x), PyixOlx), px,y@.y). 

(b) Determine a formula for E(YIX = x). 

A pizza delivery business has two phones. On each phone the waiting time 

until the first call is exponentially distributed with mean | min. Each phone is 

not influenced by the other. Let X be the shorter of the two waiting times and 
let Y be the longer. Using techniques from Sect. 4.9, it can be shown that the 
joint pdf of X and Y is fix, y) =2e°*™ for 0<x<y<oo. 

(a) Determine the marginal density of X. 

(b) Determine the conditional density of Y given X =x. 

(c) Determine the probability that Y is greater than 2, given that X = 1. 

(d) Are X and Y independent? Explain. 
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(e) Determine the conditional mean of Y given X =x. 

(f) Determine the conditional variance of Y given X =x. 

Teresa and Allison each have arrival times uniformly distributed between 

12:00 and 1:00. Their times do not influence each other. If Y is the first of the 

two times and X is the second, on a scale of 0 to 1, it can be shown that the 

joint pdf of X and Y is f(x, y)=2 forO<y<x<l. 

(a) Determine the marginal density of X. 

(b) Determine the conditional density of Y given X =x. 

(c) Determine the conditional probability that Y is between 0 and .3, given 
that X is .5. 

(d) Are X and Y independent? Explain. 

(e) Determine the conditional mean of Y given X =x. 

(f) Determine the conditional variance of Y given X =x. 

Refer back to the previous exercise. 

(a) Determine the marginal density of Y. 

(b) Determine the conditional density of X given Y=y. 

(c) Determine the conditional mean of X given Y=y. 

(d) Determine the conditional variance of X given Y=y. 

According to an article in the August 30, 2002 issue of the Chronicle of 

Higher Education, 30% of first-year college students are liberals, 20% are 

conservatives, and 50% characterize themselves as middle-of-the-road. 

Choose two students at random, let X be the number of liberals among the 

two, and let Y be the number of conservatives among the two. 

(a) Using the multinomial distribution from Sect. 4.1, give the joint probability 
mass function p(x, y) of X and Y and the corresponding joint probability table. 

(b) Determine the marginal probability mass functions by summing p(x, y) 
numerically. How could these be obtained directly? [Hint: What are the 
univariate distributions of X and Y?] 

(c) Determine the conditional probability mass function of Y given X =x for 
x=0, 1, 2. Compare this to the binomial distribution with n = 2 — x and 
p=.2/(.2+.5). Why should this work? 

(d) Are X and Y independent? Explain. 

(e) Find E(Y1X = x) for x =0, 1, 2. Do this numerically and then compare with 
the use of the formula for the binomial mean, using the binomial distribu- 
tion given in part (c). 

(f) Determine Var(YIX =x) for x=0, 1, 2. Do this numerically and then 
compare with the use of the formula for the binomial variance, using the 
binomial distribution given in part (c). 

A class has 10 mathematics majors, 6 computer science majors, and 4 statistics 

majors. Two of these students are randomly selected to make a presentation. 

Let X be the number of mathematics majors and let Y be the number of 

computer science majors chosen. 

(a) Determine the joint probability mass function p(x,y). This generalizes the 
hypergeometric distribution studied in Sect. 2.6. Give the joint probability 
table showing all nine values, of which three should be 0. 
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(b) Determine the marginal probability mass functions by summing numeri- 
cally. How could these be obtained directly? [Hint: What type of rv is X? 
Y?] 

(c) Determine the conditional probability mass function of Y given X =x for 
x=0, 1, 2. Compare with the h(y; 2 —x, 6, 10) distribution. Intuitively, 
why should this work? 

(d) Are X and Y independent? Explain. 

(e) Determine E(Y1X =x), x=0, 1, 2. Do this numerically and then compare 
with the use of the formula for the hypergeometric mean, using the 
hypergeometric distribution given in part (c). 

(f) Determine Var(YIX =x), x =0, 1, 2. Do this numerically and then compare 
with the use of the formula for the hypergeometric variance, using the 
hypergeometric distribution given in part (c). 

A 1-ft-long stick is broken at a point X (measured from the left end) chosen 

randomly uniformly along its length. Then the left part is broken at a point 

Y chosen randomly uniformly along its length. In other words, X is uniformly 

distributed between 0 and 1 and, given X=x, Y is uniformly distributed 

between 0 and x. 

(a) Determine E(Y1X = x) and then Var(Y1X = x). 

(b) Determine f(x,y) using fy(x) and fyx(ylx). 

(c) Determine fy(y). 

(d) Use f(y) from (c) to get E(Y) and Var(Y). 

(e) Use (a) and the Laws of Total Expectation and Variance to get E(Y) and 
Var(Y). 

Consider the situation in Example 4.29, and suppose further that the standard 

deviation for fares per car is $4. 

(a) Find the variance of the rv E(Y1X). 

(b) Using Expression (4.6) from the previous section, the conditional variance 
of Y given X =x is 4°x = 16x. Determine the mean of the rv Var(Y1X). 

(c) Use the Law of Total Variance to find oy, the unconditional standard 
deviation of Y. 

This week the number X of claims coming into an insurance office has a 

Poisson distribution with mean 100. The probability that any particular claim 

relates to automobile insurance is .6, independent of any other claim. If Y is 

the number of automobile claims, then Y is binomial with X trials, each with 

“success” probability .6. 

(a) Determine E(Y1X = x) and Var(YIX = x). 

(b) Use part (a) to find E(Y). 

(c) Use part (a) to find Var(Y). 

In the previous exercise, show that the distribution of Y is Poisson with mean 

60. [You will need to recognize the Maclaurin series expansion for the 

exponential function.] Use the knowledge that Y is Poisson with mean 60 to 

find E(Y) and Var(Y). 

The heights of American men follow a normal distribution with mean 70 in. 

and standard deviation 3 in. Suppose that the weight distribution (Ibs) for men 
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that are x inches tall also has a normal distribution, but with mean 4x — 104 and 
standard deviation .3x — 17. Let Y denote the weight of a randomly selected 
American man. Find the (unconditional) mean and standard deviation of Y. 
A Statistician is waiting behind one person to check out at a store. The check- 
out time for the first person, X, can be modeled by an exponential distribution 
with some parameter A > 0. The statistician observes the first person’s check- 
out time, x; being a statistician, she surmises that her check-out time Y will 
follow an exponential distribution with mean x. 
(a) Determine E(Y1X = x) and Var(YIX = x). 
(b) Use the Laws of Total Expectation and Variance to find E(Y) and Var(Y). 
(c) Write out the joint pdf of X and Y. [Hint: You have f(x) and fyix(ylx).] 
Then write an integral expression for the marginal pdf of Y (from which, at 
least in theory, one could determine the mean and variance of Y). What 
happens? 
In the game Plinko on the television game show The Price is Right, 
contestants have the opportunity to earn “chips” (flat, circular disks) that 
can be dropped down a peg board into slots labeled with cash amounts. 
Every contestant is given one chip automatically and can earn up to four 
more chips by correctly guessing the prices of certain small items. If we let 
p denote the probability a contestant correctly guesses the price of a prize, 
then the number of chips a contestant earns, X, can be modeled as X = 1+N, 
where N ~ Bin(4, p). 
(a) Determine E(X) and Var(X). 
(b) For each chip, the amount of money won on the Plinko board has the 
following distribution: 


Value | $0 $100 $500 _~—*$1000 __ $10,000 
Probability | 39 03 ll 24 23 


Determine the mean and variance of the winnings from a single chip. 

(c) Let Y denote the total winnings of a randomly selected contestant. Using 
results from the previous section, the conditional mean and variance of Y, 
given a player gets x chips, are wx and ox, respectively, where w and 
o* are the mean and variance for a single chip computed in (b). Find 
expressions for the (unconditional) mean and standard deviation of Y. 
[Note: Your answers will be functions of p.] 

(d) Evaluate your answers to part (c) for p=0, .5, and 1. Do these answers 
make sense? Explain. 

Let X and Y be any two random variables. 

(a) Show that E[Var(YIX)] = E[Y7] — Ely; x)- [Hint: Use the variance short- 
cut formula and apply the Law of Total Expectation to the first term.] 

(b) Show that Var(E[Y1X]) =Elpy x] — (E[Y])’. [Hint: Use the variance short- 
cut formula again; this time, apply the Law of Total Expectation to the 
second term.] 

(c) Combine the previous two results to establish the Law of Total Variance. 
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4.5 Limit Theorems (What Happens as n Gets Large) 


Many problems in probability and statistics involve either a sum or an average of 
random variables. In this section we consider what happens as n, the number of 
variables in such sums and averages, gets large. The most important result of this 
type is the celebrated Central Limit Theorem, according to which the approximate 
distribution is normal when n is sufficiently large. 


4.5.1 Random Samples 


The random variables from which our sums and averages will be created must 
satisfy two general conditions. 


DEFINITION 

The rvs X;, X2, ..., X, are said to be independent and identically 
distributed (iid) if 

1. The X;s are independent rvs. 

2. Every X; has the same probability distribution. 

Such a collection of rvs is also called a (simple) random sample of size n. 


For example, X,, X2,...X,, might be a random sample from a normal distribution 
with mean 100 and standard deviation 15; then the X;s are independent and each one 
has the specified normal distribution. Similarly, for these variables to constitute a 
random sample from an exponential distribution, they must be independent and the 
value of the exponential parameter / must be the same for each variable. 

The notion of iid rvs is meant to resemble (simple) random sampling from a 
population: X, is the value of some variable for the first individual or object 
selected, X> is the value of that same variable for the second selected individual 
or object, and so on. If sampling is either with replacement or from a (potentially) 
infinite population, Conditions | and 2 are satisfied exactly. These conditions will 
be approximately satisfied if sampling is without replacement, yet the sample size 
nis much smaller than the population size N. In practice, if n/N < .05 (at most 5% of 
the population is sampled), we proceed as if the X;s form a random sample. 

Throughout this section, we will be primarily interested in the properties of two 
particular rvs derived from random samples: the sample total T and the sample 
mean X: 


x ee eh 
T=Xi+---+X,= 50 Xi, X= i+ + a 
i=1 


n n- 


Note that both T and X are linear combinations of the X;s. 
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PROPOSITION 
Suppose X,, X5, ..., X,, are 1id with common mean yp and common standard 
deviation o. T and X have the following properties: 


1. E(T) =n Hag 3 Oe 

2 
2. Var(T)=no? and SD(T) = Yio 2. Var(X) = — and SD(X) = a 
3. If the X;s are normally distributed, 3. If the X;s are normally distributed, 


then T is also normally distributed. _ then X is also normally distributed. 


Proof Recall from the main theorem of Sect. 4.3 that the expected value of a sum is 
the sum of individual expected values; moreover, when the variables in the sum are 
independent, the variance of the sum is the sum of the individual variances: 


E(T) = E(X, + ---+X,) = E(X;) +--+: + E(X,) =pte-+tp=m 
Var(T) = Var(X) +++++X,) = Var(X1) +-+++4+ Var(X,) a Nee 
SD(T) = Vno? = ./no 


The corresponding results for X can be derived by writing X =!-T and using 
basic rescaling properties, such as E(cY) =cE(Y). Property 3 is a consequence of 
the more general result from Sect. 4.3 that any linear combination of independent 
normal rvs is normal. a 


According to Property 1, the distribution of X is centered precisely at the mean of 
the population from which the sample has been selected. If the sample mean is used 
to compute an estimate (educated guess) of the population mean yp, there will be no 
systematic tendency for the estimate to be too large or too small. 

Property 2 shows that the X distribution becomes more concentrated about y as 
the sample size n increases, because its standard deviation decreases. In marked 
contrast, the distribution of T becomes more spread out as n increases. Averaging 
moves probability in toward the middle, whereas totaling spreads probability out 
over a wider and wider range of values. The expression o/,/n for the standard 
deviation of X is called the standard error of the mean, and it indicates the typical 
amount by which a value of X will deviate from the true mean, y (in contrast, o itself 
represents the typical difference between an individual X; and 1). 

When o is unknown, as is usually the case when pv is unknown and we are trying 
to estimate it, we may substitute the sample standard deviation, s, of our sample into 
the standard error formula and say that an observed value of X will typically differ 
by about s/,/n from yu. This is the estimated standard error formula presented in 
Sects. 2.8 and 3.8. 
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Finally, Property 3 says that we know everything there is to know about the X and 
T distributions when the population distribution is normal. In particular, 
probabilities such as P(a <X< b) and P(c <T <d) can be obtained simply by 
standardizing. Figure 4.8 illustrates the X part of the proposition. 


Example 4.33 The amount of time that a patient undergoing a particular procedure 
spends in a certain outpatient surgery center is a random variable with a mean 
value of 4.5 h and a standard deviation of 1.4 h. Let X;, ..., X25 be the times for 
a random sample of 25 patients. Then the expected total time for the 25 patients is 
E(T) = np = 25(4.5) = 112.5 h, whereas the expected sample mean amount of time 
is E(X) = = 4.5 hours. The standard deviations of T and X are 


or = V/no = V25(1.4) = 7 hours 


1.4 
oy = a a .28 hours 


vn 25 


Suppose further that such patient times follow a normal distribution, i.e., 
X;~N(4.5, 1.4). Then the total time spent by 25 randomly selected patients in this 
center is also normal: T~N(112.5, 7). The probability their total time exceeds 
5 days (120 h) is 


120 — 112.5 
7 


P(T > 120) =1—P(T < 120) =1 o( j= (1.07) = .8577 


This same probability can be reframed in terms of X: for 25 patients, a total time 
of 120 h equates to an average time of 120/25 =4.8 h, and since X~N(4.5, .28), 


4.8 —4.5 
28 


P(X >4.8) =1 o( \=1 (1.07) = .8577 = 


Example 4.34 Resistors used in electronics manufacturing are labeled with a 
“nominal” resistance as well as a percentage tolerance. For example, a 330-ohm 
resistor with a 5% tolerance is anticipated to have an actual resistance between 
313.5 Q and 346.5 Q. Consider five such resistors, randomly selected from the 
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population of all resistors with those specifications, and model the resistance of 
each by a uniform distribution on [313.5, 346.5]. If these are connected in series, the 
resistance R of the system is given by R=X,+---+Xs5, where the X; are the iid 
uniform resistances. 

A random variable uniformly distributed on [A, B] has mean (A +8B)/2 and 
standard deviation (B — A)//12. For our uniform model, the mean resistance is 
E(X;) = (313.5 + 346.5)/2 = 330 Q, the nominal resistance, with a standard devia- 
tion of (346.5 — 313.5) /V/12 = 9.526 Q. The system’s resistance has mean and 
standard deviation 


E(R) = np = 5(330) = 16502, — SD(R) = V/no = V5(9.526) = 21.30 


But what is the probability distribution of R? Is R also uniformly distributed? 
Determining the exact pdf of R is difficult (it requires four convolutions). And the 
mef of R, while easy to obtain, is not recognizable as coming from any particular 
family of known distributions. Instead, we resort to a simulation of R, the results of 
which appear in Fig. 4.9. For 10,000 iterations in R (appropriately), five indepen- 
dent uniform variates on [313.5, 346.5] were created and summed; see Sect. 3.8 for 
information on simulating a uniform distribution. The histogram in Fig. 4.9 clearly 
indicates that R is not uniform; in fact, if anything, R appears (from the simulation, 
anyway) to be approximately normal! 
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Fig. 4.9 Simulated distribution of the random variable R in Example 4.34 | 
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4.5.2. The Central Limit Theorem 


When iid X;s are normally distributed, so are T and X for every sample size n. The 
simulation results from Example 4.34 suggest that even when the population 
distribution is not normal, summing (or averaging) produces a distribution more 
bell-shaped than the one being sampled. Upon reflection, this is quite intuitive: in 
order for R to be near 5(346.5) = 1732.5, its theoretical maximum, all five randomly 
selected resistors would have to exert resistances at the high end of their common 
range (i.e., every X; would have to be near 346.5). Thus, R-values near 1732.5 are 
unlikely, and the same applies to R’s theoretical minimum of 5(313.5) = 1567.5. On 
the other hand, there are many ways for R to be near the mean value of 1650: all five 
resistances in the middle, two low and one middle and two high, and so on. Thus, 
R is more likely to be “centrally” located than out at the extremes. (This is 
analogous to the well-known fact that rolling a pair of dice is far more likely to 
result in a sum of 7 than 2 or 12, because there are more ways to obtain 7.) 

This general pattern of behavior for sample totals and sample means is 
formalized by the most important theorem of probability, the Central Limit Theo- 
rem (CLT). A proof of this theorem is beyond the scope of this book, but interested 
readers may consult the text by Devore and Berk listed in the references. 


CENTRAL LIMIT THEOREM 

Let X,, X2, ..., X, be a random sample from a distribution with mean p and 
standard deviation o. Then, in the limit as n — oo, the standardized versions 
of T and X have the standard normal distribution. That is, 


and 


where Z is a standard normal rv. It is customary to say that T and X are 
asymptotically normal. Thus when z is sufficiently large, the sample total 
T has approximately a normal distribution with mean r= ny and standard 
deviation o7 = \/no. Equivalently, for large n the sample mean X has 
approximately a normal distribution with mean vy = wand standard deviation 


OF =o//n. 


Figure 4.10 illustrates the Central Limit Theorem for X. According to the CLT, 
when 7 is large and we wish to calculate a probability such as P(a <X< b) or 


P(c <T <4), we need only “pretend” that X or T is normal, standardize it, and use 
software or the standard normal table. The resulting answer will be approximately 
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Fig. 4.10 The Central Limit X distribution for 
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correct. The exact answer could be obtained only by first finding the distribution of 
T or X, so the CLT provides a truly impressive shortcut. 

A practical difficulty in applying the CLT is in knowing when n is “sufficiently 
large.” The problem is that the accuracy of the approximation for a particular 
n depends on the shape of the original underlying distribution being sampled. If 
the underlying distribution is symmetric and there is not much probability far out in 
the tails, then the approximation will be good even for a small n, whereas if it is 
highly skewed or has “heavy” tails, then a large n will be required. For example, if 
the distribution is uniform on an interval, then it is symmetric with no probability in 
the tails, and the normal approximation is very good for n as small as 
10 Gin Example 4.34, even for n =5, the distribution of the sample total appeared 
rather bell-shaped). However, at the other extreme, a distribution can have such fat 
tails that its mean fails to exist and the Central Limit Theorem does not apply, so no 
nis big enough. A popular, although frequently somewhat conservative, convention 
is that the Central Limit Theorem may be safely applied when n > 30. Of course, 
there are exceptions, but this rule applies to most distributions of real data. 


Example 4.35 When a batch of a certain chemical product is prepared, the amount 
of a particular impurity in the batch is a random variable with mean value 4.0 g and 
standard deviation 1.5 g. If 50 batches are independently prepared, what is the 
(approximate) probability that the total amount of impurity is between 175 and 
190 g? According to the convention mentioned above, n= 50 is large enough for 
the CLT to be applicable. The total T then has approximately a normal distribution 
with mean value wr=50(4.0) = 200 g and standard deviation o7 = V50(1.5) = 
10.6066 g. So, with Z denoting a standard normal rv, 


175 — 200 190 — 200 
<T< ~~) <Z< _ : ‘i 
P(175 <T < 190) P( Te aan (—.94) — ©(—2.36) 
= .1645 


Notice that nothing was said initially about the shape of the underlying impurity 
distribution. It could be normally distributed, or uniform, or positively skewed— 
regardless, the CLT ensures that the distribution of their total, T, is approximately 
normal. a 
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Example 4.36 Suppose the number of times a randomly selected customer of a 
large bank uses the bank’s ATM during a particular period is a random variable 
with a mean value of 3.2 and a standard deviation of 2.4. Among 100 randomly 
selected customers, how likely is it that the sample mean number of times the 
bank’s ATM is used exceeds 4? Let X; denote the number of times the ith customer 
in the sample uses the bank’s ATM. Notice that X; is a discrete rv, but the CLT is not 
limited to continuous random variables. Also, although the fact that the standard 
deviation of this nonnegative variable is quite large relative to the mean value 
suggests that its distribution is positively skewed, the large sample size implies that 
X does have approximately a normal distribution. Using Hy =H =3.2 and 


oz = o//n = 2.4/V100 = .24, 


4—3.2 
.24 


P(X >4) ~P(z> ) = 1-06.33) = .0004 7 


Example 4.37 Consider the distribution shown in Fig. 4.11 for the amount pur- 
chased (rounded to the nearest dollar) by a randomly selected customer at a 
particular gas station (a similar distribution for purchases in Britain (in £) appeared 
in the article “Data Mining for Fun and Profit,” Statistical Science, 2000: 
111—131; there were big spikes at the values 10, 15, 20, 25, and 30). The 
distribution is obviously quite non-normal. 

We asked Matlab to select 1000 different samples, each consisting of n= 15 
observations, and calculate the value of the sample mean for each one. Figure 4.12 
is a histogram of the resulting 1000 values; this is the approximate distribution of X 
under the specified circumstances. This distribution is clearly approximately 
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Fig. 4.11 Probability distribution of X = amount of gasoline purchased ($) in Example 4.37 


4.5 Limit Theorems (What Happens as n Gets Large) 359 


Density 


0.00 T T T rok > 
18 21 24 27 30 33 36 
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Fig. 4.12 Approximate sampling distribution of the sample mean amount purchased when n = 15 
and the population distribution is as shown in Fig. 4.11 
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Fig. 4.13 Normal probability plot from Matlab of the 1000 x values based on samples of size n = 15 


normal even though the sample size is not all that large. As further evidence for 
normality, Fig. 4.13 shows a normal probability plot of the 1000 x values; the linear 
pattern is very prominent. It is typically not non-normality in the central part of the 
population distribution that causes the CLT to fail, but instead very substantial 
skewness or heavy tails. a 
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The CLT can also be generalized so it applies to non-identically distributed 
independent random variables and certain linear combinations. Roughly speaking, 
if n is large and no individual term is likely to contribute too much to the overall 
value, then asymptotic normality prevails (see Exercise 190). It can also be 
generalized to sums of variables which are not independent provided the extent 
of dependence between most pairs of variables is not too strong. 


4.5.3 Other Applications of the Central Limit Theorem 


The CLT can be used to justify the normal approximation to the binomial distribu- 
tion discussed in Sect. 3.3. Recall that a binomial variable X is the number of 
successes in a binomial experiment consisting of m independent success/failure 


trials with p = P(success) for any particular trial. Define new rvs X,, X2, ..., X, by 
X= 1 if the 7th trial results in a success @o1 ) 
‘| 0. if the ith trial results in a failure es ea 


Because the trials are independent and P(success) is constant from trial to trial, 
the X;s are iid (a random sample from a Bernoulli distribution). When the X;s 
are summed, a | is added for every success that occurs and a 0 for every failure, so 
X=X,+---+X,, their total. The sample mean of the X,s is X = X/n, the sample 
proportion of successes, which in previous discussions we have denoted P. 
The Central Limit Theorem then implies that if n is sufficiently large, both X and 
P are approximately normal when n is large. We summarize properties of the P 
distribution in the following corollary; Statements | and 2 were derived in Sect. 2.4. 


COROLLARY 

Consider an event A in the sample space of some experiment with p = P(A). 
Let X=the number of times A occurs when the experiment is repeated 
n independent times, and define 


Then 


(continued) 
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pil =p) 


2 op = SD(P) = 


3. As n increases, the distribution of P approaches a normal distribution. 
In practice, Property 3 is taken to say that P is approximately normal, 
provided that np > 10 and n(1 — p) => 10. 


The necessary sample size for this approximation depends on the value of p: 
when p is close to .5, the distribution of each X; is reasonably symmetric (see 
Fig. 4.14), whereas the distribution is quite skewed when p is near 0 or 1. Using the 
approximation only if both np > 10 and n(1 — p) > 10 ensures that n is large enough 
to overcome any skewness in the underlying Bernoulli distribution. 


Fig. 4.14 Two Bernoulli a b 
distributions: (a) p= .4 


(reasonably symmetric); (b) 


0 1 0 1 


Example 4.38 A computer simulation in the style of Sect. 1.6 is used to determine 
the probability that a complex system of components operates properly throughout 
the warranty period. Unknown to the investigator, the true probability is P(A) = .18. 
If 10,000 simulations of the underlying process are run, what is the chance the 
estimated probability P(A) will lie within .01 of the true probability P(A)? 

Apply the preceding corollary, with n=10,000 and p=P(A)=.18. The 
expected value of the estimator P(A) is p=.18, and the standard deviation is og 


= 1/.18(.82)/10,000 = .00384. Since np = 1800 > 10 and n(1 — p) = 8200 > 10, a 
normal distribution can safely be used to approximate the distribution of P(A). This 
sample proportion is within .01 of the true probability, .18, iff .17 < P (A) < .19, so 
the desired likelihood is approximately 


A 17 — 18 19 — .18 
PLIT<F <.19) ~P( noma oo poses ) = ©(2.60) — &(—2.60) 
= .9906 7 


The normal distribution serves as a reasonable approximation to the binomial 
pmf when n is large because the binomial distribution is additive, i.e., a binomial rv 
can be expressed as the sum of other, iid rvs. Other additive distributions include the 
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Poisson, negative binomial, gamma, and (of course) normal distributions; some of 
these were discussed at the end of Sect. 4.3. In particular, CLT justifies normal 
approximations to the following distributions: 
¢ Poisson, when y is large 
¢ Negative binomial, when r is large 
¢« Gamma, when a is large 

As a final application of the CLT, first recall from Sect. 3.5 that X has a 
lognormal distribution if In(X) has a normal distribution. 


PROPOSITION 

Let X,, X2, ..., X, be a random sample from a distribution for which only 
positive values are possible [P(X; > 0) = 1]. Then if n is sufficiently large, the 
product Y= X, X,---X, has approximately a lognormal distribution; that is, 
In(Y) has a normal distribution. 


To verify this, note that 
In(Y) = In(X,) + In(X,) +--+ + In(X,,) 


Since In(Y) is a sum of independent and identically distributed rvs [the In(X;)s], it is 
approximately normal when n is large, so Y itself has approximately a lognormal 
distribution. As an example of the applicability of this result, it has been argued that 
the damage process in plastic flow and crack propagation is a multiplicative 
process, so that variables such as percentage elongation and rupture strength have 
approximately lognormal distributions. 


4.5.4 The Law of Large Numbers 


In the simulation sections of Chaps. 1-3, we described how a sample proportion P 
could estimate a true probability p, and a sample mean X served to approximate a 
theoretical expected value yx. Moreover, in both cases the precision of the estimation 
improves as the number of simulation runs, n, increases. We would like to be able to 
say that our estimates “converge” to the correct values in some sense. Such a 
convergence statement is justified by another important theoretical result, called 
the Law of Large Numbers. 

To begin, recall the first proposition in this section: If X,, X2, ..., X, is arandom 
sample from a distribution with mean yp and standard deviation o, then E (xX) =u 


and Var(X) = o*/n. As n increases, the expected value of X remains at y but the 


variance approaches zero; that is, E[(X - n)] a Var(X) = 6° /n — 0. We say that 
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X converges in mean square to 4 because the mean of the squared difference 
between X and p goes to zero. This is one form of the Law of Large Numbers. 

Another form of convergence states that as the sample size n increases, X is 
increasingly unlikely to differ by any set amount from y. More precisely, let ¢ be a 
positive number close to 0, such as .01 or .001, and consider P(| ee w\|> é), the 
probability that X differs from y by at least ¢ (at least .01, at least .001, etc.). We will 
prove shortly with the help of Chebyshev’s inequality that, no matter how small the 
value of ¢, this probability will approach zero as n— oo. Because of this, 
statisticians say that X converges to p in probability. 

The two forms of the Law of Large Numbers are summarized in the following 
theorem. 


LAW OF LARGE NUMBERS 
If X,, Xo, ..., X, is arandom sample from a distribution with mean p, then X 
converges to pu 


1. In mean square: E|(® )| > 0 as n— oo 
2. In probability: P(| X — w| >) + 0 as n— ow for any e>0 


Proof The proof of Statement 1 appears a few paragraphs above. For Statement 
2, recall Chebyshev’s inequality, which states that for any rv Y, P(IY — pyl > koy) < 
I/k° for any k>1 (ie., the probability that Y is at least k standard devi- 
ations away from its mean is at most 1/k’). Let Y =X, so fy = E(X) =w and 
oy = SD (x) = o/,/n. Now, for any ¢ > 0, determine the value of & such that ¢ = 
koy = ko/./n. Solving for k yields k = e\/n/o, which for sufficiently large n will 
exceed 1. Apply Chebyshev’s inequality: 


1 = e/n o 1 
P(| Y — py | = koy <p> P(X H\= : )s > 
(Yn 2 kor) <oe P( 1-9 |2 OZ) ce 
= o 
P(|X-yu|>e) <= 0asn 
en 
That is, P(| X — w | >) + 0 as n— ow for any e>0. = 


Convergence of X to y in probability actually holds even if the variance o° does 
not exist (a heavy-tailed distribution) as long as y is finite. But then Chebyshev’s 
inequality cannot be used, and the proof is much more complicated. 
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An analogous result holds for proportions. If the X; are iid Bernoulli(p) rvs, 
then similar to the discussion earlier in this section we may write XasP , and 


u=E(X;)=p. It follows that the sample proportion P converges to the “true” 
proportion p 


1. In mean square: E|(P p) | 0 as n— on, and 


2. In probability: P(| P-p\|> €) > 0asn— oo for any e>0. 
In statistical language, the Law of Large Numbers states that X is a consistent 


estimator of jz, and P is a consistent estimator of p. This consistency property also 
applies to other estimators. For example, it can be shown that the sample variance 


Ss = S- (x; —-X ‘a /(n— 1) converges in probability to the population variance o”. 


4.5.5 Exercises: Section 4.5 (85-102) 


85. The inside diameter of a randomly selected piston ring is a random variable 
with mean value 12 cm and standard deviation .04 cm. 

(a) If X is the sample mean diameter for a random sample of n= 16 rings, 
where is the sampling distribution of X centered, and what is the standard 
deviation of the X distribution? 

(b) Answer the questions posed in part (a) for a sample size of n= 64 rings. 

(c) For which of the two random samples, the one of part (a) or the one of part 
(b), isX more likely to be within .01 cm of 12 cm? Explain your reasoning. 

86. Refer to the previous exercise. Suppose the distribution of diameter is normal. 
(a) Calculate P(11.99 < X < 12.01) whenn=16. 

(b) How likely is it that the sample mean diameter exceeds 12.01 when 
n=25? 

87. Suppose that the fracture angle under pure compression of a randomly 
selected specimen of fiber reinforced polymer-matrix composite material is 
normally distributed with mean value 53 and standard deviation 1 (suggested 
in the article “Stochastic Failure Modelling of Unidirectional Composite Ply 
Failure,” Reliability Engr. and System Safety, 2012: 1-9; this type of material 
is used extensively in the aerospace industry). 

(a) Ifa random sample of 4 specimens is selected, what is the probability that 
the sample mean fracture angle is at most 54? Between 53 and 54? 

(b) How many such specimens would be required to ensure that the first 
probability in (a) is at least .999? 

88. The time taken by a randomly selected applicant for a mortgage to fill out a 
certain form has a normal distribution with mean value 10 min and standard 
deviation 2 min. If five individuals fill out a form on | day and six on another, 
what is the probability that the sample average amount of time taken on each 
day is at most 11 min? 
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89. 


90. 


91. 


92. 


93. 


The lifetime of a type of battery is normally distributed with mean value 10h 
and standard deviation 1 h. There are four batteries in a package. What 
lifetime value is such that the total lifetime of all batteries in a package 
exceeds that value for only 5% of all packages? 

The National Health Statistics Reports dated Oct. 22, 2008 stated that for a 
sample size of 277 18-year-old American males, the sample mean waist 
circumference was 86.3 cm. A somewhat complicated method was used to 
estimate various population percentiles, resulting in the following values: 


Sth 10th 25th 50th 75th 90th 95th 
69.6 70.9 75.2 81.3 95.4 107.1 116.4 


(a) Is it plausible that the waist size distribution is at least approximately 
normal? Explain your reasoning. If your answer is no, conjecture the 
shape of the population distribution. 

(b) Suppose that the population mean waist size is 85 cm and that the 
population standard deviation is 15 cm. How likely is it that a random 
sample of 277 individuals will result in a sample mean waist size of at 
least 86.3 cm? 

(c) Referring back to (b), suppose now that the population mean waist size is 
82 cm (closer to the median than the mean). Now what is the (approxi- 
mate) probability that the sample mean will be at least 86.3? In light of 
this calculation, do you think that 82 is a reasonable value for 4? 

A friend commutes by bus to and from work 6 days per week. Suppose that 

waiting time is uniformly distributed between 0 and 10 min, and that waiting 

times going and returning on various days are independent of each other. 

What is the approximate probability that total waiting time for an entire week 

is at most 75 min? 

There are 40 students in an elementary statistics class. On the basis of years of 

experience, the instructor knows that the time needed to grade a randomly 

chosen paper from the first exam is a random variable with an expected value 
of 6 min and a standard deviation of 6 min. 

(a) If grading times are independent and the instructor begins grading at 
6:50 p.m. and grades continuously, what is the (approximate) probability 
that he is through grading before the 11:00 p.m. TV news begins? 

(b) If the sports report begins at 11:10, what is the probability that he misses 
part of the report if he waits until grading is done before turning on the 
TV? 

The tip percentage at a restaurant has a mean value of 18% and a standard 

deviation of 6%. 

(a) What is the approximate probability that the sample mean tip percentage 
for a random sample of 40 bills is between 16 and 19%? 

(b) If the sample size had been 15 rather than 40, could the probability 
requested in part (a) be calculated from the given information? 
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94. 


95. 


96. 


97. 


98. 
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A small high school holds its graduation ceremony in the gym. Because of 

seating constraints, students are limited to a maximum of four tickets to 

graduation for family and friends. The vice principal knows that historically 

30% of students want four tickets, 25% want three, 25% want two, 15% want 

one, and 5% want none. 

(a) Let X = the number of tickets requested by a randomly selected graduating 
student, and assume the historical distribution applies to this rv. Find the 
mean and standard deviation of X. 

(b) Let T =the total number of tickets requested by the 150 students graduating 
this year. Assuming all 150 students’ requests are independent, determine 
the mean and standard deviation of T. 

(c) The gym can seat a maximum of 500 guests. Calculate the (approximate) 
probability that all students’ requests can be accommodated. [Hint: Express 
this probability in terms of T. What distribution does T have?] 

Let X represent the amount of gasoline (gallons) purchased by a randomly 

selected customer at a gas station. Suppose that the mean and standard devia- 

tion of X are 11.5 and 4.0, respectively. 

(a) In a sample of 50 randomly selected customers, what is the approximate 
probability that the sample mean amount purchased is at least 12 gallons? 

(b) In a sample of 50 randomly selected customers, what is the approximate 
probability that the total amount of gasoline purchased is at most 
600 gallons? 

(c) What is the approximate value of the 95th percentile for the total amount 
purchased by 50 randomly selected customers? 

For males the expected pulse rate is 70 per second and the standard deviation is 

10 per second. For women the expected pulse rate is 77 per second and the 

standard deviation is 12 per second. Let X = the sample average pulse rate for a 

random sample of 40 men and let Y =the sample average pulse rate for a 

random sample of 36 women. 

(a) What is the approximate distribution of X? Of Y? 

(b) What is the approximate distribution of X — Y? Justify your answer. 

(c) Calculate (approximately) the probability P(—2 < X — Y < 1). 

(d) Calculate (approximately) P(X -Y¥< —15). If you actually observed 
X — Y < —15, would you doubt that y,; — #2 =—7? Explain. 

The first assignment in a statistical computing class involves running a short 

program. If past experience indicates that 40% of all students will make no 

programming errors, use an appropriate normal approximation to compute the 
probability that in a class of 50 students 

(a) At least 25 will make no errors. 

(b) Between 15 and 25 (inclusive) will make no errors. 

The number of parking tickets issued in a certain city on any given weekday has 

a Poisson distribution with parameter 44 = 50. What is the approximate proba- 

bility that 
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99. 


100. 


101. 


102. 


4.6 


(a) Between 35 and 70 tickets are given out on a particular day? 

(b) The total number of tickets given out during a 5-day week is between 
225 and 275? [For parts (a) and (b), use an appropriate CLT 
approximation. ] 

(c) Use software to obtain the exact probabilities in (a) and (b), and compare 
to the approximations. 

Suppose the distribution of the time X (in hours) spent by students at a certain 

university on a particular project is gamma with parameters a = 50 and 6 =2. 

Use CLT to compute the (approximate) probability that a randomly selected 

student spends at most 125 h on the project. 

The Central Limit Theorem says that X is approximately normal if the sample 

size is large. More specifically, the theorem states that the standardized X has a 

limiting standard normal distribution. That is, (X — y) /(o/,/n) has a distri- 

bution approaching the standard normal. Can you reconcile this with the Law 
of Large Numbers? 

It can be shown that if Y,, converges in probability to a constant 7, then h(Y,,) 

converges to h(t) for any function / that is continuous at t. Use this to obtain a 

consistent estimator for the rate parameter A of an exponential distribution. 

[Hint: How does yw for an exponential distribution relate to the exponential 

parameter 1?] 

Let X,,..., X,, be arandom sample from the uniform distribution on [0, 0]. Let 

Y,, be the maximum of these observations: Y,, = max(X}, .. ., X,,). Show that Y,, 

converges in probability to 6, that is, that P(IY,, — 01 > ¢) — 0 as n approaches 

oo. [Hint: We shall show in Sect. 4.9 that the pdf of Y,, is f(y) =ny"'/0” for 

O0<y<@.] 


Transformations of Jointly Distributed Random Variables 


In the previous chapter we discussed the problem of starting with a single random 

variable X, forming some function of X, such as Y =X? or Y = e*, and investigating 

the distribution of this new random variable Y. We now generalize this scenario by 

starting with more than a single random variable. Consider as an example a system 

having a component that can be replaced just once before the system itself expires. 

Let X, denote the lifetime of the original component and X> the lifetime of the 

replacement component. Then any of the following functions of X; and X may be 

of interest to an investigator: 

1. The total lifetime, X,; + X>. 

2. The ratio of lifetimes X,/X2 (for example, if the value of this ratio is 2, the 
original component lasted twice as long as its replacement). 

3. The ratio X,/(X; +X), which represents the proportion of system lifetime during 
which the original component operated. 
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4.6.1 The Joint Distribution of Two New Random Variables 


Given two random variables X, and X>, consider forming two new random variables 
Y, =u, (X,, X2) and Yz = u(X1, Xz). Our focus is on finding the joint distribution of 
these two new variables. Since most applications assume that the X;s are continu- 
ous, we restrict ourselves to that case. Some notation is needed before a general 
result can be given. Let 


f(%, X2) =the joint pdf of the two original variables 
g()1,Y2) = the joint pdf of the two new variables 


The u,(- ) and u2( - ) functions express the new variables in terms of the original 
ones. The general result presumes that these functions can be inverted to solve for 
the original variables in terms of the new ones: 


X, =vi(¥1,Y2), X2 = v2(¥1, Y2) 


For example, if 


Xx) 


=x, +x andy, = 
J 1 2 yo ee 


then multiplying yz by y, gives an expression for x,, and then we can substitute this 
into the expression for y, and solve for x»: 


x1 = Y1Y2 = 111,92) x2 = y,(1 — yo) = v2(1, Ya) 
In a final burst of notation, let 
S = {(x1,%2):f@1,%2) > O} = T= {(1,¥2) : 8011, ¥2) > OF 


That is, S is the region of positive density for the original variables and T is the 
region of positive density for the new variables; T is the “image” of S under the 
transformation. 


TRANSFORMATION THEOREM (bivariate case) 
Suppose that the partial derivative of each v,(y;, y2) with respect to both y; 
and y> exists and is continuous for every (1, y2) © T. Form the 2 x 2 matrix 


Ov1(y1,¥2) Ovi(y1,92) 


ie Oy, Oyo 
= Ov2(y¥1,Y2) Ov2(y1,¥2) 
Oy, Oy, 


The determinant of this matrix, called the Jacobian, is 


(continued) 
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Ov, Ove Ov) Ov 
dy, Oy, Oy, Oy; 


det(M) = 


The joint pdf for the new variables then results from taking the joint pdf 
J(X1, X2) for the original variables, replacing x, and x2 by their expressions in 
terms of y, and yo, and finally multiplying this by the absolute value of the 
Jacobian: 


8015¥2) = F(101,¥2),¥201592)) >| det(M) | 1,2) © T 


The theorem can be rewritten slightly by using the notation 


- (6 O(x1,%2) 
O(¥1,Y2) 


det(M 


Then we have 


0(x1, x2) 


O(y1,¥2)] 


which is the natural extension of the univariate transformation theorem g(y)= 
f(x) - |dx/dy| discussed in Chap. 3. 


8(V15,¥2) =f (x1, 2X2) 


Example 4.39 Continuing with the component lifetime situation, suppose that X, 
and X, are independent, each having an exponential distribution with parameter J. 
Let’s determine the joint pdf of 
Y (X1,X2) =X: +X. and Y (X1,X2) = a 
=u = an = 

1 141,42 1 2 2 = U2\A1,A2 bets oe 

We have already inverted this transformation: 
Mm =V1(V1,¥2) =WY2 2 = v2.92) = Yi(1 — yo) 


The image of the transformation, i.e., the set of (y;, y2) pairs with positive 
density, is y; > 0 and 0 < yz < 1. The four relevant partial derivatives are 


Ovy Ov} Ove 1 Ovo 
Oy, : dy) Oy, ? Oy» ; 
from which the Jacobian is — yyy2—-y;(1— yo) = — yy. 


Since the joint pdf of X, and X> is 
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f (01, %2) = je-* - fern = J eA t2) x1 > 0, x2 >0 
we have 
BV 92) = Ve -y, = Pye «1 yy >0, O<y <1 


The joint pdf thus factors into two parts. The first part is a gamma pdf with 
parameters a= 2 and f= 1/4, and the second part is a uniform pdf on (0, 1). Since 
the pdf factors and the region of positive density is rectangular, we have 
demonstrated that 
1. The distribution of system lifetime X, +X is gamma (with a= 2, P= 1/A) 

2. The distribution of the proportion of system lifetime during which the original 

component functions is uniform on (0, 1) 

3. Y;=X,+X, and Y, =X ,/(X; + X2) are independent of each other | 


In the foregoing example, because the joint pdf factored into one pdf involving 
y, alone and another pdf involving y2 alone, the individual (i.e., marginal) pdfs of 
the two new variables were obtained from the joint pdf without any further effort. 
Often this will not be the case—that is, Y, and Y> will not be independent. Then to 
obtain the marginal pdf of Y,, the joint pdf must be integrated over all values of the 
second variable. In fact, in many applications an investigator wishes to obtain the 
distribution of a single function u,(X,, X2) of the original variables. To accomplish 
this, a second function u3(X,, X>) is selected, the joint pdf is obtained, and then y, 
integrated out. There are of course many ways to select the second function. The 
choice should be made so that the transformation can be easily inverted and the 
integration in the last step is straightforward. 


Example 4.40 Consider a rectangular coordinate system with a horizontal x, axis 
and a vertical x2 axis as shown in Fig. 4.15a. First a point (X;, X2) is randomly 
selected, where the joint pdf of X,, X> is 


oy fm tx Ose i Uses] 
f(%1,%2) = { 0 otherwise 


Then a rectangle with vertices (0, 0), (X,, 0), (0, X2), and (X;, X2) is formed as 
shown in Fig. 4.15a. What is the distribution of X,X5, the area of this rectangle? To 
answer this question, let 


Y, = XX Y, =X) 
NO) 
a Uy (X1,X2) = x1X2 y= Up(x1,X2) = Xz 


Then 
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0 al 0 T anal 


Fig. 4.15 Regions of positive density for Example 4.40 


Bal 


x1 = V1 ()1,)o) = i x2 = V2(Y1,¥2) = V2 
2 


Notice that because x2 (= y2) is between 0 and 1 and y, is the product of the two 
x;S, it must be the case that 0 < y; < y2. The region of positive density for the new 
variables is then 


T = {(1, 92) :0<y, < y2,0 < yg < 1} 


which is the triangular region shown in Fig. 4.15b. 
Since Ov,/dy, =0, the product of the two off-diagonal elements in the matrix 
M will be 0, so only the two diagonal elements contribute to the Jacobian: 
: 1 
M= 2 > | det(M) | rw 
0 1 y2 


The joint pdf of the two new variables is now 


0 otherwise 


To obtain the marginal pdf of Y; alone, we must now fix y; at some arbitrary 
value between 0 and 1, and integrate out y2. Figure 4.15b shows that we must 
integrate along the vertical line segment passing through y, whose lower limit is y, 
and whose upper limit is 1: 
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1 

y 1 

1 (91) =| (2+».) -—dy, =2(1-y,) O0<y, <1 
yy 2 y2 


This marginal pdf can now be integrated to obtain any desired probability 
involving the area. For example, integrating from 0 to .5 gives P(area < .5)=.75. 


4.6.2. The Joint Distribution of More Than Two New Variables 


Consider now starting with three random variables X,, X2, and X3, and forming 
three new variables Y,, Y>, and Y3. Suppose again that the transformation can be 
inverted to express the original variables in terms of the new ones: 


X1 =Vi(V1y V2, Y3)s X2 = V2(V1y Vos V3)> 3 = V3(V1,Y2, Ys) 


Then the foregoing theorem can be extended to this new situation. The Jacobian 
matrix has dimension 3 x 3, with the entry in the ith row and jth column being 
Ov,/Oy;. The joint pdf of the new variables results from replacing each x; in the 
original pdf f(-) by its expression in terms of the y;s and multiplying by the absolute 
value of the Jacobian. 


Example 4.41 Consider n = 3 identical components with independent lifetimes X), 
X, X3, each having an exponential distribution with parameter 4. If the first 
component is used until it fails, replaced by the second one which remains in 
service until it fails, and finally the third component is used until failure, then the 
total lifetime of these components is Y3=X,+X2+X3. (This design structure, 
where one component is replaced by the next in succession, is called a standby 
system.) To find the distribution of total lifetime, let’s first define two other new 
variables: Y; =X, and Y2 =X, + X2 (so that Y; < Y2 < Y3). After finding the joint pdf 
of all three variables, we integrate out the first two variables to obtain the desired 
information. Solving for the old variables in terms of the new gives 


1 = yj 42 = )2— V1 X3 = Y3 — Yo 


It is obvious by inspection of these expressions that the three diagonal elements 
of the Jacobian matrix are all 1s and that the elements above the diagonal are all 0s, 
so the determinant is 1, the product of the diagonal elements. Since 


Ff (X1, %2,%3) = BeAr tets) xX, >0,%) > 0,x; > 0 
by substitution, 
801,23) = VE 0<y1 <2 <ys 


Integrating this joint pdf first with respect to y, between 0 and y, and then with 
respect to y2 between 0 and jy; (try it!) gives 
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a 
83(¥3) =z —-y3 > 0 


which is the gamma pdf with a=3 and # = 1/4. This result and Example 3.39 are 
both special cases of a proposition from Sect. 4.3, stating that the sum of n iid 
exponential rvs has a gamma distribution with a=n. a 


4.6.3 Exercises: Section 4.6 (103-110) 


103. Let X, and X> be independent, standard normal rvs. 

(a) Define Y; = X,+X> and Y,=X,—X>. Determine the joint pdf of Y, and 
Y>. 

(b) Determine the marginal pdf of Y,. [Note: We know the sum of two 
independent normal rvs is normal, so you can check your answer against 
the appropriate normal pdf.] 

(c) Are Y, and Y> independent? 

104. Consider two components whose lifetimes X, and X> are independent and 
exponentially distributed with parameters 4, and Az, respectively. Obtain the 
joint pdf of total lifetime X,+X 2 and the proportion of total lifetime 
X,/(X,+X2) during which the first component operates. 

105. Let X, denote the time (hr) it takes to perform a first task and X2 denote the 
time it takes to perform a second one. The second task always takes at least as 
long to perform as the first task. The joint pdf of these variables is 


_fma+m) O< mim <1 
Flex) = { 0 otherwise 


(a) Obtain the pdf of the total completion time for the two tasks. 
(b) Obtain the pdf of the difference X,—X, between the longer completion 
time and the shorter time. 
106. An exam consists of a problem section and a short-answer section. Let X, 
denote the amount of time (h) that a student spends on the problem section and 
X> represent the amount of time the same student spends on the short-answer 
section. Suppose the joint pdf of these two times is 


x x 
CX1X2 ee 0<x <1 
f X1,%2) = 
0 otherwise 


(a) What is the value of c? 
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107. 


108. 


109. 
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(b) Ifthe student spends exactly .25 h on the short-answer section, what is the 
probability that at most .60 h was spent on the problem section? [Hint: 
First obtain the relevant conditional distribution. ] 

(c) What is the probability that the amount of time spent on the problem part 
of the exam exceeds the amount of time spent on the short-answer part by 
at least .5 h? 

(d) Obtain the joint distribution of Y; = X2/X, the ratio of the two times, and 
Y2=X>. Then obtain the marginal distribution of the ratio. 

Consider randomly selecting a point (X,, X2, X3) in the unit cube {(%, x2, x3): 

0<x,<1,0<x.<1,0<.23 <1} according to the joint pdf 


(122,93) = 8x112%3 0<x <1, O<m<1, 0<4<1 
Se aecie ae 0 otherwise 


(so the three variables are independent). Then form a rectangular solid whose 
vertices are (0, 0, 0), (X1, 0, 0), (O, X2, 0), (X1, Xo, 0), (0, 0, X3), (X47, 0, X3), 
(0, X2, X3), and (X,, X2, X3). The volume of this cube is Y3 = X,X>X3. Obtain 
the pdf of this volume. [Hint: Let Y; =X, and Y,=X,X>.] 

Let X, and X> be independent, each having a standard normal distribution. The 
pair (X,, X2) corresponds to a point in a two-dimensional coordinate system. 
Consider now changing to polar coordinates via the transformation, 


Y,; =Xi+X3 


Xx 
arctan( 2) X; >0,X, >0 
1 


Xo 
arctan{| —]+2x X, >0,X. <0 
- (2) 1>0,% 


Xx 
arctan (2) +n X,; <0 
xX, 


0 X,=0 


from whichX, = /Y; cos(Y2), X2 = /Y; sin(Y2). Obtain the joint pdf of the 
new variables and then the marginal distribution of each one. [Note: It would 
be nice if we could simply let Y>=arctan(X/X,), but in order to insure 
invertibility of the arctan function, it is defined to take on values only 
between —1/2 and x/2. Our specification of Y> allows it to assume any value 
between 0 and 2z7.] 

The result of the previous exercise suggests how observed values of two 
independent standard normal variables can be generated by first generating 
their polar coordinates with an exponential rv with 2 = 5 and an independent 
Unif(0, 2) rv: Let U,; and U> be independent Unif(0, 1) rvs, and then let 
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Y; = —2In(U;) Y> = 2nUp, 
Z, = VY, cos(Y2) Zo = VY; sin(Y2) 


Show that the Z;s are independent standard normal. [Note: This is called the 
Box-Muller transformation after the two individuals who discovered it. Now 
that statistical software packages will generate almost instantaneously 
observations from a normal distribution with any mean and variance, it is 
thankfully no longer necessary for people like you and us to carry out the 
transformations just described—tlet the software do it!] 

110. Let X, and Xj be independent random variables, each having a standard 
normal distribution. Show that the pdf of the ratio Y= X,/X>2 is given by 
iy) = I/[nd +y’)] for —oco<y<oo. (This is called the standard Cauchy 
distribution; its density curve is bell-shaped, but the tails are so heavy that y 
does not exist.) 


4.7 The Bivariate Normal Distribution 


Perhaps the most useful joint distribution is the bivariate normal. Although the 
formula may seem rather complicated, it is based on a simple quadratic expression 
in the standardized variables (subtract the mean and then divide by the standard 
deviation). The bivariate normal density is 


pe ees) -eo8) (G8) (9))) 


The notation used here for the five parameters reflects the roles they play. Some 
tedious integration shows that y, and o, are the mean and standard deviation, 
respectively, of X, #2 and o> are the mean and standard deviation, respectively, of 
Y, and p is the correlation coefficient between the two variables. The integration 
required to do bivariate normal probability calculations is quite difficult. Computer 
code is available for calculating P(X<x, Y<y) approximately using numerical 
integration, and some software packages, including Matlab and R, incorporate 
this feature (see the end of this section). 

The density surface in three dimensions looks like a mountain with elliptical 
cross-sections, as shown in Fig. 4.16a. The vertical cross-sections are all propor- 
tional to normal densities. If we set f(x, y)=c to investigate the contours (curves 
along which the density is constant), this amounts to equating the exponent of the 
joint pdf to a constant. The contours are then concentric ellipses centered at (x, y) = 
(1 , #2), as Shown in Fig. 4.16b. 

If p=0, then f(x,y) =fx@) fr(y), where X is normal with mean py, and standard 
deviation o;, and Y is normal with mean ps and standard deviation o2. That is, X and 
Y have independent normal distributions. In this case the elliptical contours reduce 
to circles. Recall that in Sect. 4.2 we emphasized that independence of X and 
Y implies = 0 but, in general, » =0 does not imply independence. However, we 
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> xX 


Fig. 4.16 (a) A graph of the bivariate normal pdf; (b) contours of the bivariate normal pdf 


have just seen that when X and Y are bivariate normal p= 0 does imply indepen- 
dence. Therefore, in the bivariate normal case p = 0 if and only if the two rvs are 
independent. 

Regardless of whether or not p=0, the marginal distribution fy(x) is just a 
normal pdf with mean yz, and standard deviation o;: 


1 2 2 

be —(x-m1)’/(267) 
x)= e 

Px) o,V2n 


The integration to show this [integrating f(x,y) on y from —oo to ov] is rather 
messy. Likewise, the marginal distribution of Y is N(w2, 02). These two marginal 
pdfs are, in fact, just special cases of a much stronger result. 


PROPOSITION 

IfX and Y have a bivariate normal distribution, then any linear combination of 
X and Y is also normal. That is, for any constants a, b, c, the random variable 
aX + bY +c has a normal distribution. 


This proposition can be proved using the transformation techniques of Sect. 4.6 
along with some extremely tedious algebra. Setting a= 1 and b=c=0, we have 
that X is normally distributed; a=0, b= 1, c=0 yields the same result for Y. To 
find the mean and standard deviation of a general linear combination, one can use 
the rules for linear combinations established in Sect. 4.3. 


Example 4.42 Many students applying for college take the SAT, which consists of 
three components: Critical Reading, Mathematics, and Writing. While some 
colleges use all three components to determine admission, many only look at the 
first two (reading and math). Let X and Y denote the Critical Reading and Mathe- 
matics scores, respectively, for a randomly selected student. According to the 


4.7 The Bivariate Normal Distribution 377 


College Board Web site, the population of students taking the exam in Fall 2012 
had the following results: 


HM, = 496, 6, = 114, 4, = 514,06) = 117 


Suppose that X and Y have approximately (because both X and Y are discrete) a 
bivariate normal distribution with correlation coefficient p = .25. Let’s determine 
the probability that a student’s total score across these two components exceeds 
1250, the minimum admission score for a particular university. 

Our goal is to calculate P(X + Y> 1250). Using the bivariate normal pdf, the 
desired probability is a daunting double integral: 


eee 
2n(114)(117)V 1 — .257 J—co J1250-y 


é {[(x—496) /114]"—2(.25)(x—496)(y 514)/(114)(117)-+1(9-514)/1177}/[2(1-.25")] aedy 


This is not a practical way to solve this problem! Instead, recognize X + Y as a 
linear combination of X and Y; by the preceding proposition, X¥+Y has a normal 
distribution. The mean and variance of X + Y are calculated using the formulas from 
Sect. 4.5: 

E(X + Y) = E(X) + E(Y) =, + wy = 496 + 514 = 1010 

Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y) 
= 0, + 65 + 2poion = 114° + 117? + 2(.25) (114) (117) = 33,354 


Therefore, 
1250 — 1010 
33,354 


Suppose instead we wish to determine P(X < Y), the probability a student scores 
better on math than on reading. If we rewrite this probability as P(X — Y <0), then 
we may apply the preceding proposition to the linear combination X — Y. With 
E(X — Y) =-18 and Var(X — Y) = 20,016, 


0 — (-18) 
20,016 


P(X+Y > 1250) =1 o( y= (1.31) = .0951. 


PI < 1) =PX-¥ <0) =o ) = (0.13) = 5517. 


4.7.1 Conditional Distributions of X and Y 


As in Sect. 4.4, the conditional density of Y given X =x results from dividing the 
marginal density of X into f(x,y). The algebra is again a mess, but the result is fairly 
simple. 
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PROPOSITION 
Let X and Y have a bivariate normal distribution. Then the conditional 
distribution of Y, given X =x, is normal with mean and variance 


x—y 
By. = EY | X =x) = wp 1s 


OF yay = Var(Y | X =x) = 05(1 —p’) 


Notice that the conditional mean of Y is a linear function of x, and the conditional 
variance of Y doesn’t depend on x at all. When p = 0, the conditional mean is the 
mean of Y, jz, and the conditional variance is just the variance of Y, o%. In other 
words, if p =0, then the conditional distribution of Y is the same as the uncondi- 
tional distribution of Y. When p is close to | or —1 the conditional variance will be 
much smaller than Var(Y), which says that knowledge of X will be very helpful in 
predicting Y. If p is near 0 then X and Y are nearly independent and knowledge of 
X is not very useful in predicting Y. 


Example 4.43 Let X and Y be the heights of a randomly selected mother and her 
daughter, respectively. A similar situation was one of the first applications of the 
bivariate normal distribution, by Francis Galton in 1886, and the data was found to 
fit the distribution very well. Suppose a bivariate normal distribution with mean 
fv, = 64 in. and standard deviation o,=3 in. for X and mean p2=65 in. and 
standard deviation o.=3 in. for Y. Here >, which is in accord with the 
increase in height from one generation to the next. Assume p = .4. Then 


2 :— 64 
igo aay” 


lite = ot pos - = 65+ .4(x— 64) = 4x + 39.4 
1 


OF y=x = Var(Y |X =x) =03(1—p’) =9(1—.47) =7.56 and oyy=,=2.75 
Notice that the conditional variance is 16% less than the variance of Y. Squaring 


the correlation gives the percentage by which the conditional variance is reduced 
relative to the variance of Y. a 


4.7.2 Regression to the Mean 


The formula for the conditional mean can be reexpressed as 
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Hyix=x — M2 _ p- x Ki 
02 o| 


In words, when the formula is expressed in terms of standardized quantities, the 
standardized conditional mean is just p times the standardized x. In particular, for 
the height scenario 

Hyix-y 95, x— 64 


= 
3 3 


If the mother is 5 in. above the mean of 64 in. for mothers, then the daughter’s 
conditional expected height is just 2 in. above the mean for daughters. In this 
example, with equal standard deviations for Y and X, the daughter’s conditional 
expected height is always closer to its mean than the mother’s height is to its mean. 
One can think of the conditional expectation as falling back toward the mean, and 
that is why Galton called this regression to the mean. 

Regression to the mean occurs in many contexts. For example, let X be a baseball 
player’s average for the first half of the season and let Y be the average for the 
second half. Most of the players with a high X (above .300) will not have such a 
high Y. The same kind of reasoning applies to the “sophomore jinx,” which says that 
if a player has a very good first season, then the player is unlikely to do as well in the 
second season. 


4.7.3. The Multivariate Normal Distribution 


The multivariate normal distribution extends the bivariate normal distribution to 
situations involving models for 7 random variables X,, X2, ... X, with n >2. The 
joint density function is quite complicated; the only way to express it compactly is 
to make use of matrix algebra notation. And probability calculations based on this 
distribution are extremely complex. 
Here are some of the most important properties of the distribution: 
¢ The distribution of any linear combination of X,, X2, ... X,, is normal 
¢ The marginal distribution of any X; is normal 
¢ The joint distribution of any pair X;, X; is bivariate normal 
¢ The conditional distribution of any X; given values of the other n — | variables is 
normal 
Many procedures for the analysis of multivariate data (observations simulta- 
neously on three or more variables) are based on assuming that the data was 
selected from a multivariate normal distribution. We recommend Methods of 
Multivariate Analysis, 3rd ed., by Rencher for more information on multivariate 
analysis and the multivariate normal distribution. 
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4.7.4 Bivariate Normal Calculations with Software 


Matlab will compute probabilities under the bivariate normal pdf using the 
mvncdf command (“mvn” abbreviates multivariate normal). This function is 
illustrated in the next example. 


Example 4.44 Consider the SAT reading/math scenario of Example 4.42. What is 
the probability that a randomly selected student scored at most 650 on both 
components, i.e., what is P(X < 650M Y < 650)? 

The desired probability cannot be expressed in terms of a linear combination of 
X and Y, and so the technique of the earlier example does not apply. Figure 4.17 
shows the required Matlab code. The first two inputs are the desired cdf values 
[x, y] = [650, 650] and the means [j1, #2] = [496, 514], respectively. The third input 
is called the covariance matrix of X and Y, defined by 


_ | Var(X) Cov(X,Y)] | of poor 
re Cov(X,Y)  Var(Y) | |poior. 0 


Fig. 4.17 Matlab code for mu=[496, 514]; 
Example 4.44 C=([114*2, .2o*TL44s 1173 «2541144017, 1172] 4 
mvncdf([650, 650],mu,C) 


Matlab returns an answer of .8097, so for X and Y having a bivariate normal 
distribution with the parameters specified in Example 4.42, P(X <650 
Y < 650) = .8097. About 81% of students scored 650 or below on both the Critical 
Reading and Mathematics components, according to this model. a 


The pmvnorm function in R will perform the same calculation with the same 
inputs (the covariance matrix is labeled sigma). Users must install the mvtnorm 
package to access this function. 


4.7.5 Exercises: Section 4.7 (111-120) 


111. Example 4.42 introduced a bivariate normal model for X =SAT Critical 
Reading score and Y=SAT Mathematics score. Let W=SAT Writing 
score (the third component of the SAT), which has mean 488 and standard 
deviation 114. Suppose X and W have a bivariate normal distribution with 
Px,w = Corr(X, W)=.5. 

(a) An English department plans to use X + W, a student’s total score on the 
non-math sections of the SAT, to help determine admission. Determine 
the distribution of X + W. 

(b) Calculate P(X + W > 1200). 
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112. 


113. 


114. 


115. 


(c) Suppose the English department wishes to admit only those students who 
score in the top 10% on this Critical Reading + Writing criterion. What 
combined score separates the top 10% of students from the rest? 

In the context of the previous exercise, let T= X + Y + W, a student’s grand 

total score on the three components of the SAT. 

(a) Find the expected value of T. 

(b) Assume Corr(Y, W) =.2. Find the variance of T. [Hint: Use Expression 
(4.5) from Sect. 4.3.] 

(c) Suppose X, Y, W have a multivariate normal distribution, in which case 
T is also normally distributed. Determine P(T > 2000). 

(d) What is the 99th percentile of SAT grand total scores, according to this 
model? 

Let X = height (inches) and Y = weight (lbs) for an American male. Suppose 

X and Y have a bivariate normal distribution, the mean and sd of heights are 

70 in and 3 in. the mean and sd of weights are 170 lbs and 20 Ibs, and the 

correlation coefficient is p= .9. 

(a) Determine the distribution of Y given X = 68, i.e., the weight distribution 
for 5’8” American males. 

(b) Determine the distribution of Y given X = 70, i.e., the weight distribution 
for 5°10” American males. In what ways is this distribution similar to 
that of part (a), and how are they different? 

(c) Calculate P(Y < 180IX = 72), the probability that a 6-ft-tall American 
male weighs less than 180 Ib. 

In electrical engineering, the unwanted “noise” in voltage or current signals 

is often modeled by a Gaussian (i.e., normal) distribution. Suppose that the 

noise in a particular voltage signal has a constant mean of 0.9 V, and that two 
noise instances sampled z seconds apart have a bivariate normal distribution 

with covariance equal to 0.04e"""°. Let X and Y denote the noise at times 3 s 

and 8 s, respectively. 

(a) Determine Cov(X, Y). 

(b) Determine oy and oy. [Hint: Var(X) = Cov(X, X).] 

(c) Determine Corr(X, Y). 

(d) Find the probability we observe greater voltage noise at time 3 s than at 
time 8 s. 

(e) Find the probability that the voltage noise at time 3 s is more than | V 
above the voltage noise at time 8 s. 

For a Calculus I class, the final exam score Y and the average X of the four 

earlier tests have a bivariate normal distribution with mean yw, = 73, standard 

deviation o, = 12, mean p> = 70, standard deviation oz = 15. The correlation 
is p=.71. Determine 
(a) pyix=x 


(b) OF y= 


(C) 6yix=x 
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(d) P(Y > 90IX = 80), i.e., the probability that the final exam score exceeds 
90 given that the average of the four earlier tests is 80 

Refer to the previous exercise. Suppose a student’s Calculus I grade is 

determined by 4X + Y, the total score across five tests. 

(a) Find the mean and standard deviation of 4X + Y. 

(b) Determine P(4X + Y < 320). 

(c) Suppose the instructor sets the curve in such a way that the top 15% of 
students, based on total score across the five tests, will receive As. What 
point total is required to get an A in Calculus I? 

Let X and Y, reaction times (sec) to two different stimuli, have a bivariate 

normal distribution with mean pv, =20 and standard deviation o, =2 for 

X and mean plz = 30 and standard deviation o7=5 for Y. Assume p=.8. 

Determine 

(a) Myix=x 

(b) oF. 

(C) Oyix=x 

(d) P(Y > 461X =25) 

Refer to the previous exercise. 

(a) One researcher is interested in X + Y, the total reaction time to the two 
stimuli. Determine the mean and standard deviation of X + Y. 

(b) If X and Y were independent, what would be the standard deviation of 
X + Y? Explain why it makes sense that the sd in part (a) is much larger 
than this. 

(c) Another researcher is interested in Y —X, the difference in the reaction 
times to the two stimuli. Determine the mean and standard deviation of 
Y-X. 

(d) If X and Y were independent, what would be the standard deviation of 
Y — X? Explain why it makes sense that the sd in part (c) is much smaller 
than this. 

Let X and Y be the times for a randomly selected individual to complete two 

different tasks, and assume that (X, Y) has a bivariate normal distribution 

with vw; = 100, 6, = 50, v2 = 25, 62 =5, p= 4. From statistical software we 
obtain P(X < 100, Y<25)=.3333, P(X <50, Y<20)=.0625, P(X <50, 

Y < 25) =.1274, and P(X < 100, Y < 20) = .1274. 

(a) Determine P(50 < X < 100, 20<Y< 25). 

(b) Leave the other parameters the same but change the correlation to 
p =0 (independence). Now recompute the probability in part (a). Intui- 
tively, why should the original be larger? 

One of the propositions of this section gives an expression for E(YIX = x). 

(a) By reversing the roles of X and Y give a similar formula for E(XIY = y). 

(b) Both E(YIX = x) and E(X1Y = y) are linear functions. Show that the product 
of the two slopes is p*. 


4.8 Reliability 383 


4.8 Reliability 


Reliability theory is the branch of statistics and operations research devoted to 
studying how long systems will function properly. A “system” can refer to a single 
device, such as a DVR, or a network of devices or objects connected together (e.g., 
electronic components or stages in an assembly line). For any given system, the 
primary variable of interest is T= the system’s lifetime, i.e., the duration of time 
until the system fails (either permanently or until repairs/upgrades are made). Since 
T measures time, we always have T > 0. Most often, T is modeled as a continuous rv 
on (0, 00), though occasionally lifetimes are modeled as discrete or, at least, having 
positive probability of equaling zero (such as a light bulb that never turns on). The 
probability distribution of T is often described in terms of its reliability function. 


4.8.1 The Reliability Function 


DEFINITION 

Let T denote the lifetime (i.e., the time to failure) of some system. The 
reliability function of T (or of the system), denoted by R(), is defined for 
t>0 by 


RO=PT>)=1-FO, 


where F(?) is the cdf of T. That is, R() is the probability that the system lasts 
more that f time units. The reliability function is sometimes also called the 
survival function of T. 


Properties of F(f) and the relation R(#) = 1 — F(t) imply that 
1. If T is a continuous rv on [0, oo), then R(O) = 1. 
2. R(t) is a non-increasing function of t¢. 
3. R()— 0 as t— oo. 


Example 4.45 The exponential distribution serves as one of the most common 
lifetime models in engineering practice. Suppose the lifetime 7, in hours, of a 
certain drill bit is exponential with parameter 1= .01 (equivalently, mean 100). 
From Sect. 3.4, we know that T has cdf F(t) = 1 — oo so the reliability function of 
T is 

R@)=1-F()=e £20 


This function satisfies properties 1-3 above. A graph of R(t) appears in Fig. 4.18a. 

Now suppose instead that 5% of these drill bits shatter upon initial use, so that 
P(T=0)=.05, while the remaining 95% of such drill bits follow the afore- 
mentioned exponential distribution. Since T cannot be negative, R(O) = P(T > 0) = 
1 — P(T=0) =.95. For t > 0, the reliability function of T is determined as follows: 
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Rj=P Tr >a) 
= P(bit doesn’t shatter)P(T > t | bit doesn’t shatter) 
= (95)(e 8") = 05e 


~°l comes from the previous reliability function calculation. Since 


—.01r 


The expression e 
this expression for R(‘) equals .95 at t=0, we have for all t> 0 that R(t) = .95e 
(see Fig. 4.18b). This, too, is a non-increasing function of t with R(t) > 0 as t— on, 
but property 1 does not hold because T is not a continuous rv (it has a “mass” of 
.O5 at t=0). 


a b 
R(t) RO 
A A 
1 
0.95 
_._ r . = ft 
0 100 200 0 100 200 


Fig. 4.18 Reliability functions: (a) a continuous lifetime distribution; (b) lifetime with positive 
probability of failure at r=0 a 


Example 4.46 The Weibull family of distributions offers a broader class of models 
than does the exponential family. Recall from Sect. 3.5 that the cdf of a Weibull rv 
is given by F(x) = 1— exp(-(a/f)"), where a is the shape parameter and / is the 
scale parameter (both > 0). If a system’s time to failure follows a Weibull distribu- 
tion, then the reliability function is 


R(t) = 1 — F(t) = exp(—(¢/B)") 


Several examples of Weibull reliability functions are illustrated in Fig. 4.19. The 
a=1 case corresponds to an exponential distribution with 2= 1/f. Interestingly, 
models with larger values of a have higher reliability for small values of ¢ (to be 
precise, f < f) but lower reliability for larger ¢t than do Weibull models with small a 
parameter values. 


4.8 Reliability 385 


Fig. 4.19 Reliability functions for Weibull lifetime distributions 


4.8.2 Series and Parallel Designs 


Now consider assessing the reliability of systems configured in series and/or 
parallel designs. Figure 4.20 illustrates the two basic designs: a series system 
works if and only if all of its components work, while a parallel system continues 
to function as long as at least one of its components is still functioning. Let 7), . .., 
T,, denote the n component lifetimes and let R,(‘)=P(T;> 1) be the reliability 
function of the ith component. A standard assumption in reliability theory is that 
the n components operate independently, i.e., that the T;s are independent rvs. 

Let T denote the lifetime of the series system depicted in Fig. 4.20a. Under the 
assumption of component independence, the system reliability function is 


R(t) = P(T > t) = P(the system’s lifetime exceeds f) 


= P(all n component lifetimes exceed 1) series system 
=P(T; >tn...AT, >?) 
PUT > Desh Se by independence 


=R,(t)-...-R,(t) 


That is, for a series design, the system reliability function equals the product of 
the component reliability functions. On the other hand, the reliability function for 
the parallel system in Fig. 4.20b is given by 
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Fig. 4.20 Basic system a 
designs: (a) series 
connection; (b) parallel 
connection 


R(t) = P(the system’s lifetime exceeds f) 
= P(at least one component lifetime exceeds f) parallel system 
= | — P(all component lifetimes are < r) 
=1—-P(T, <tn...NT, <1) 
=1=P(% 50> 2 P Op sd) by independence 


=1-[1-R,(d]-...-[1—R,(0)] 


These two results are summarized in the following proposition. 


PROPOSITION 

Suppose a system consists of n independent components with reliability 
functions R,(A),..., Rp. 

1. Ifthe n components are connected in series, the system reliability function is 


R(t) = [2 


2. If the n components are connected in parallel, the system reliability function is 


n 


R() =1-][-R0)] 


ii 


Example 4.47 Consider three independently operating devices, each of whose 
lifetime (in hours) is exponentially distributed with mean 100. From the previous 
example, R,(4) = R2(t) = R3(t) = e°" If these three devices are connected in series, 
the reliability function of the resulting system is 


3 


R(t) = [[20 — (eo) (e~"") (e") = oe 03t 


i=l 


In contrast, a parallel system using these three devices as its components has 
reliability function 
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3 


R(t) = 1- II —-R(])=1-- gon? 


i=l 


These two reliability functions are graphed on the same set of axes in Fig. 4.21. 
Both functions obey properties 1-3 from p. 383, but for any ¢ > 0 the parallel system 
reliability exceeds that of the series system, as it logically should. For example, the 
probability the series system’s lifetime exceeds 100 h (the expected lifetime of a single 
component) is R(100) = e 03100) _ 63 — 0498, while the corresponding reliability 
for the parallel system is R(100) = 1 — (1 — e 83 = 1 -(1—e7 ' 8 =.7474. 


0 100 200 300 400 


Fig. 4.21 Reliability functions for the series (solid) and parallel (dashed) systems of Example 
4.47 | 


Example 4.48 Consider the system depicted below, which consists of a combina- 
tion of series and parallel elements. Using previous notation and assuming compo- 
nent lifetimes are independent, let’s determine the reliability function of this 
system. More than one method may be applied here; we will rely on the Addition 
Rule: 
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[T1 > tT >t] UT3 > 2) 

T) >tNT,>t)+P(T3>t)-P(T; >tANT,>tNT3>t) Addition Rule 
T, > t)P(T2 >t) +P(T3 > t) —P(T, > t)P(T2 > t)P(T3 >t) independence 
(1)Ro(t) + Ra(t) — Ri(t)Ro(t)Ra(t) 


It can be shown that this reliability function satisfies properties 1—3 (the first and 
last are quite easy). If all three components have common reliability 
function = R(t) =e°" as in Example 4.47, the system reliability function becomes 
R(t) =e 0" +e-°"' — e&*", which lies in between the two reliability functions of 
Example 4.47 for all t > 0. a 


P( 
P( 
P( 
Ri 


4.8.3 Mean Time to Failure 


If T denotes a system’s lifetime, i.e., its time until failure, then the mean time to 
failure (mttf) of the system is simply E(T). The following proposition relates mean 
time to failure to the reliability function. 


PROPOSITION 
Suppose a system has reliability function R(t) for t>0. Then the system’s 
mean time to failure is given by 


br = [ [1 — F(t)|dt =| (oat (4.8) 


0 


Expression (4.8) was established for all non-negative random variables in 
Exercises 38 and 150 of Chap. 3. 

As a simple demonstration of this proposition, consider a single exponential 
lifetime with mean 100 h. We have already seen that R(t) = e~°"" for this particular 
lifetime model; integrating the reliability function yields 


—.01t}%° 


lee) foe) 1 
R(j\dt=| ear = =0 = 100 
i 0 le =r; =01 


which is indeed the mean lifetime (aka mean time to failure) in this situation. The 
advantage of using Eq. (4.8) instead of the definition of E(T) from Chap. 3 is that 
the former is usually an easier integral to calculate than the latter. Here, for 
example, direct computation of the mean time to failure would be 
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10.@) 


E(T) = | 


t-f(t)dt =| Olte "dt, 
0 


0 


which requires integration by parts (while the preceding computation did not). 


Example 4.49 Consider again the series and parallel systems of Example 4.47. 
Using Eq. (4.8), the mttf of the series system is 


CO CO 1 
Ur = | R(t)dt -| e dt = — x 33.33 hours 
F F 03 


More generally, if n independent components are connected in series, and each 
one has an exponentially distributed lifetime with common mean yw, then the 
system’s mean time to failure is p/n. 

In contrast, mttf for the parallel system is given by 


| R(njdt =| (1-fr-e™)a=| (ernesae cee at 
0 


0 0 
3 3 1 550 
MN mo ao 3° 


There is no simple formula for the mttf of a parallel system, even when the 
components have identical exponential distributions. i] 


4.8.4 Hazard Functions 


The reliability function of a system specifies the likelihood that the system will last 
beyond a prescribed time, ¢. An alternative characterization of reliability, called the 
hazard function, conveys information about the likelihood of imminent failure at 
any time ¢. 


DEFINITION 
Let T denote the lifetime of a system. If the rv T has pdf f(t) and cdf F(4), the 
hazard function is defined by 


f(t) 


OS EG) 


If T has reliability function R(d), the hazard function may also be written as 


h(t) = f(t)/R(1). 


Since the pdf f(t) is not a probability, neither is the hazard function h(t). To get a 
sense of what the hazard function represents, consider the following question: 
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Given that the system has survived past time ¢, what is the probability the system 
will fail within the next Af time units (an imminent failure)? Such a probability may 
be computed as follows: 

t+At 


ee POdt pa) at 
PIT <t+At|T>H= PIT >1) = RD RD 


= h(t). At 


Rearranging, we have h(t) P(T <t+AtIT>f/At; more precisely, h(f) is the 
limit of the right-hand side as At — 0. This suggests that the hazard function hA(f) is a 
density function, like f(t), except that (7) relates to the conditional probability that 
the system is about to fail. 


Example 4.50 Once again, consider an exponentially distributed lifetime, T, but 
with arbitrary parameter 1. The pdf and reliability function of T are de and e™, 
respectively, so the hazard function of T is 


Fh ae 
=f) ==) 


In other words, a system whose time to failure follows an exponential distribution 
will have a constant hazard function. (The converse is true, too; we’ll see how to 
recover f(t) from A(t) shortly.) This relates to the memoryless property of the 
exponential distribution: given the system has functioned for f hours thus far, the 
chance of surviving any additional amount of time is independent of t. As mentioned 
in Sect. 3.4, this suggests the system does not “wear out” as time progresses (which 
may be realistic for some devices in the short term, but notin the long term). 


Example 4.51 Suppose instead that we model a system’s lifetime with a Weibull 
distribution. From the formulas presented in Sect. 3.5, 
(OQ _@fp)rte™ _ a x 


eae Fi) 1—[1—e-W8"] © i 


For a= 1, this is identical to the exponential distribution hazard function (with 
P= 1/4). For a> 1, h(t) is an increasing function of f, meaning that we are more 
likely to observe an imminent failure as time progresses (this is equivalent to the 
system wearing out). For 0 <a < 1, h(t) decreases with t, which would suggest that 
failures become Jess likely as t increases! This can actually be realistic for small 
values of ¢: for many devices, manufacturing flaws cause a handful to fail very 
early, and those that survive this initial “burn in” period are actually more likely to 
survive a while longer (since they presumably don’t have severe faults). a 
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Fig. 4.22 A prototype h(t) 
hazard function 4 


burn stable burn 
in out 


Figure 4.22 shows a prototype hazard function, popularly called a “bathtub” 
shape. The function can be divided into three time intervals: (1) a “burn in” period 
of early failures due to manufacturing errors; (2) a “stable” period where failures 
are due primarily to chance; and (3) a “burn out” period with an increasing failure 
rate due to devices wearing out. In practice, most hazard functions exhibit one or 
more of these behaviors. 

There is a one-to-one correspondence between the pdf of a system lifetime, f(a), 
and its hazard function, h(t). The definition of the hazard function shows how one 
may derive h(t) from f(t); the following proposition reverses the process. 


PROPOSITION 
Suppose a system has a continuous lifetime distribution on [0, oo) with hazard 
function h(t). Then its lifetime (aka time to failure) has reliability function 
R(t) given by 


R(t) =e if h(u)du 


and pdf f(t) given by 


Proof Since R(t) = 1—F(d, R'(t) =-f(, and the hazard function may be written 
as h(t) =—R'(t)/R(t). Now integrate both sides: 
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| ntupa a -| RY) iy = —m[R(u)]|’, = —In(R(0)] + In{R(0)] 


Since the system lifetime is assumed to be continuous on [0, oo), R(O) = 1 and 
In[R(0)] = 0. This leaves the equation 


—In|R(t)] = j h(u)du, 


and the formula for R(‘) follows by solving for R(t). The formula for f(4) follows 
from the previous observation that R(t) =-(t), so f(t) =—R'(1), and then applying 
the chain rule: 


f(t) =-R'() =-© les a] gee “LS h(u)du] 


=2F Hi A(u)du h(t) 
The last step utilizes the Fundamental Theorem of Calculus. a 


The formulas for R(f) and f(4) in the preceding proposition can be easily modified 
for the case where T=0 with some positive probability, and so R(O) <1 (see 
Exercise 132). 


Example 4.52 A certain type of high-quality transistors has hazard function 
A(t)=1+ 1° for t > 0, where ¢ is measured in thousands of hours. This function is 
illustrated in Fig. 4.23a; notice there is no “burn in” period, but we see a fairly 
stable interval followed by burnout. The corresponding pdf for transistor 
lifetimes is 


f(t) = h(tyen Jo "oe = (1 ey 19) e~ Jo (144) = (1 + Pe") 


This pdf appears in Fig. 4.23b. 
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Fig. 4.23 (a) Hazard function and (b) pdf for Example 4.52 


4.8.5 Exercises: Section 4.8 (121-132) 


121. 


122. 


123. 


The lifetime, in thousands of hours, of the motor in a certain brand of kitchen 

blender has a Weibull distribution with a=2 and f=1. 

(a) Determine the reliability function of such a motor and then graph it. 

(b) What is the probability a motor of this type will last more than 1,500 h? 

(c) Determine the hazard function of such a motor and then graph it. 

(d) Find the mean time to failure of such a motor. Compare your answer to 
the expected value of a Weibull distribution given in Sect. 3.5. [Hint: Let 
u=x’, and apply the gamma integral formula (3.5) to the resulting 
integral. ] 

High-speed Internet customers are often frustrated by modem crashes. Sup- 

pose the time to “failure” for one particular brand of cable modem, measured 

in hundreds of hours, follows a gamma distribution with a= f= 2. 

(a) Determine and graph the reliability function for this brand of cable 
modem. 

(b) What is the probability such a modem does not need to be refreshed for 
more than 300 h? 

(c) Find the mean time to “failure” for such a modem. Verify that your 
answer matches the formula for the mean of a gamma rv given in 
Sect. 3.4. 

(d) Determine and graph the hazard function for this type of modem. 

Empirical evidence suggests that the electric ignition on a certain brand of 

gas stove has the following lifetime distribution, measured in thousands of 

days: 
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124. 


125. 


126. 
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_ f.3mse O<¢<2 
Oe { 0 otherwise 


(Notice that the model indicates that all such ignitions expire within 
2,000 days, a little less than 6 years.) 
(a) Determine and graph the reliability function for this model, for all ¢ > 0. 
(b) Determine and graph the hazard function for 0 <t< 2. 
(c) What happens to the hazard function for f > 2? 
The manufacture of a certain children’s toy involves an assembly line with 
five stations. The lifetimes of the equipment at these stations are independent 
and all exponentially distributed; the mean time to failure at the first three 
stations (in hundreds of hours) is 1.5, while the mttf at the last two stations is 
2.4. 
(a) Determine the reliability function for each of the five individual stations. 
(b) Determine the reliability function for the assembly line. [Hint: An 

assembly line is an example of what type of design?] 
(c) Find the mean time to failure for the assembly line. 
(d) Determine the hazard function for the assembly line. 
A local bar owns four of the blenders described in Exercise 121, each having 
a Weibull(2, 1) lifetime distribution. During peak hours, these blenders are in 
continuous use, but the bartenders can keep making blended drinks 
(margaritas, etc.) provided that at least one of the four blenders is still 
functional. Define the “system” to be the four blenders under continuous 
use as described above, and define the lifetime of the system to be the length 
of time that at least one of the blenders is still functional. (Assume none of 
the blenders is replaced until all four have worn out.) 
(a) What sort of system design do we have in this example? 
(b) Find the reliability function of the system. 
(c) Find the hazard function of the system. 
(d) Find the mean time to failure of the system. [See the hint from Exercise 
121(d).] 

Consider the six-component system displayed below. Let R,(0), ..., Re() 
denote the reliability functions of the components. Assume the six 
components operate independently. 


1 2 3 


4 5 6 

(a) Find the system reliability function. 

(b) Assuming all six components have exponentially distributed lifetimes 
with mean 100 h, find the mean time to failure for the system. 
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127. 


128. 


129. 


130. 


Consider the six-component system displayed below. Let R,(0), ..., Re() 
denote the component reliability functions. Assume the six components 
operate independently. 


1 3 [5 | 
2 4 [6 | 


(a) Find the system reliability function. 
(b) Assuming all six components have exponentially distributed lifetimes 


with mean 100 h, find the mean time to failure for the system. 
A certain machine has the following hazard function: 


h(t) = .002 O0<t< 200 
~ ) .001 +t> 200 


This corresponds to a situation where a device with an exponentially 
distributed lifetime is replaced after 200 h of operation by another, better 
device also having an exponential lifetime distribution. 

(a) Determine and graph the reliability function. 

(b) Determine the probability density function of the machine’s lifetime. 
(c) Find the mean time to failure. 

Suppose the hazard function of a device is given by 


t 
0 otherwise 


for some a, f > 0. This model asserts that if a device lasts # hours, it will last 

forever (while seemingly unreasonable, this model can be used to study just 

“initial wearout’’). 

(a) Find the reliability function. 

(b) Find the pdf of device lifetime. 

Suppose n independent devices are connected in series and that the ith 

device has an exponential lifetime distribution with parameter 4;. 

(a) Find the reliability function of the series system. 

(b) Show that the system lifetime also has an exponential distribution, and 
identify its parameter in terms of 2, ..., Ap. 

(c) If the mean lifetimes of the individual devices are wy), ..., My, find an 
expression for the mean lifetime of the series system. 
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(d) If the same devices were connected in parallel, would the resulting 
system’s lifetime also be exponentially distributed? How can you tell? 

131. Show that a device whose hazard function is constant must have an exponen- 
tial lifetime distribution. 

132. Reconsider the drill bits described in Example 4.45, of which 5% shatter 
instantly (and so have lifetime T=0). It was established that the reliability 
function for this scenario is R(t) = 95e°O"" t > 0. 

(a) A generalized version of expected value that applies to distributions with 
both discrete and continuous elements can be used to show that the mean 
lifetime of these drill bits is (.05)(0)+(.95)(100) =95 h. Verify that 
Eq. (4.8) applied to R(t) gives the same answer. [This suggests that our 
proposition about mttf can be used even when the lifetime distribution 
assigns positive probability to 0.] 

(b) For t>0, the expression h(t) =—R’(H)/R(f) is still valid. Find the hazard 
function for t > 0. 

(c) Find a formula for R(‘) in terms of H(t) that applies in situations where 
R(O) <1. Verify that you recover R(t) = .95e°"" when your formula is 
applied to A(t) from part (b). [Hint: Look at the earlier proposition in this 
section. What one change needs to occur to accommodate R(0) < 1?] 


4.9 Order Statistics 


Many situations arise in practice that involve ordering sample observations from 
smallest to largest and then manipulating these ordered values in various ways. For 
example, once the bidding has closed in a hidden-bid auction (one in which bids are 
submitted independently of one another), the largest bid in the resulting sample is 
the amount paid for the item being auctioned, and the difference between the largest 
and second largest bids can be regarded as the amount that the successful bidder has 
overpaid. 

Suppose that X,, X2, ..., X,, is arandom sample from a continuous distribution. 
Because of continuity, for any i, j with i#j, P(X; = X;) =0. This implies that with 
probability 1, the 7 sample observations will all be different (of course, in practice 
all measuring instruments have accuracy limitations, so tied values may in fact 
result). 


DEFINITION 
The order statistics from a random sample are the random variables Y,, .. .Y, 
given by 


(continued) 
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Y, = the smallest among X1,X2, ...,X,(i-e., the sample minimum) 
Y2 = the second smallest among X1,X2, ...,Xy 
Y, = the largest among X,, X2, ...,X,(the sample maximum) 


Thus, with probability 1, Y) << Y.2<...<Y,_1<Yj,. 


The sample median (i.e., the middle value in the ordered list) is then Y(, + 12 
when n is odd, while the sample range is Y,, — Y,. 


4.9.1 The Distributions of Y,, and Y, 


The key idea in obtaining the distribution of the sample maximum Y,, is the 
observation that Y,, is at most y if and only if every one of the X;s is at most y. 
Similarly, the distribution of Y, is based on the fact that it will exceed y if and only if 
all X;s exceed y. 


Example 4.53 Consider 5 identical components connected in parallel as shown in 
Fig. 4.20b. Let X; denote the lifetime, in hours, of the ith component (i= 1, 2, 3, 
4, 5). Suppose that the X;s are independent and that each has an exponential 
distribution with 4 = .01, so the expected lifetime of any particular component is 
1/A= 100 h. Because of the parallel configuration, the system will continue to 
function as long as at least one component is still working, and will fail as soon 
as the last component functioning ceases to do so. That is, the system lifetime is Ys, 
the largest order statistic in a sample of size 5 from the specified exponential 
distribution. Now Y5 will be at most y if and only if every one of the five X;s is at 
most y. With G;(y) denoting the cdf of Ys, 


Gs(y) = P(¥s < y) = P(X1 <yNX2 SN... Xs <y) 
= P(X, <y)- P(X. <y)---: P(X5 <y) 


For every one of the Xjs, P(X; < y)=F(y)= J? Ole dx = 1—e~°: this is 
the common cdf of the X;s evaluated at y. Hence, G5(y)=(1— ware 
(l—e °?)=(1—e °”)?. The pdf of Y; can now be obtained by differentiating 
the cdf with respect to y. 

Suppose instead that the five components are connected in series rather than in 
parallel (Fig. 4.20a). In this case the system lifetime will be Y,, the smallest of the 
five order statistics, since the system will crash as soon as a single one of the 
individual components fails. Note that system lifetime will exceed y hours if and 
only if the lifetime of every component exceeds y hours. Thus, the cdf of Y, is 
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Gi(y) =P(%) <y)=1-P(Y >y)=1-P(X >yNXo >yn...AXs >y) 
=1—P(X, > y)-P(X,>y)--+> P(Xs > y)=1- (e~ 9)? =1-e% 


This is the form of an exponential cdf with parameter .05. More generally, if the 
n components in a series connection have lifetimes that are independent, each 
exponentially distributed with the same parameter J, then the system lifetime will 
be exponentially distributed with parameter n/. We saw a similar result in Example 
4.49. The expected system lifetime will then be 1/(mA), much smaller than the 
expected lifetime of an individual component. a 


An argument parallel to that of the previous example for a general sample size 
n and an arbitrary pdf f(x) gives the following general results. 


PROPOSITION 

Let Y, and Y, denote the smallest and largest order statistics, respectively, 
based on a random sample from a continuous distribution with cdf F(x) and 
pdf f(x). Then the cdf and pdf of Y,, are 


Gay) = FON", al) = 2lFO)I"™ £0) 
The cdf and pdf of Y; are 


Gi(y) =1-[1-FO)}", 210) =n[1-FO)]"” -f0) 


Example 4.54 Let X denote the contents of a one-gallon container, and suppose 
that its pdf is f(x) =2x for O<x<1 (and 0 otherwise) with corresponding cdf 
F(x)=x° in the interval of positive density. Consider a random sample of four 
such containers. The order statistics Y; and Y4 represent the contents of the least- 
filled container and the most-filled container, respectively. The pdfs of Y; and Y, are 


(1-9)? -2y=8y[1-y]? O<y<1 
(y?)* 29 = 8’ O<y<1 


81 (9) 


=4 
ga(y) =4 


The corresponding density curves appear in Fig. 4.24. 
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Fig. 4.24 Density curves for the order statistics (a) Y, and (b) Y4 in Example 4.54 


Let’s determine the expected value of Y,—Yj,, the difference between the 
contents of the most-filled container and the least-filled container; Y4—Y, is just 
the sample range. Apply linearity of expectation: 


1 1 
E(¥4— V1) = E(¥4) — E(¥1) = |» ay'dy-| y - 8y(1 —y*) "dy 


8 384 


=o oie 889 — .406 = .483 


If random samples of four containers were repeatedly selected and the sample 
range of contents determined for each one, the long run average value of the range 
would be .483. P 


4.9.2. The Distribution of the ith Order Statistic 


We have already obtained the (marginal) distribution of the largest order statistic Y,, 
and also that of the smallest order statistic Y,;. A generalization of the argument 
used previously results in the following proposition; Exercise 140 suggests how this 
result can be derived. 


PROPOSITION 
Suppose X1, X, ..., X,, is a random sample from a continuous distribution 
with cdf F(x) and pdf f(x). The pdf of the ith smallest order statistic Y; is 

n! 


0) = G-Dim—a! FO) — FO)!" FO) (4.9) 


400 4 Joint Probability Distributions and Their Applications 


An intuitive justification for Expression (4.9) will be given shortly. Notice that it 
is consistent with the pdf expressions for g;(y) and g,(y) given previously; just 
substitute i= 1 and i=n, respectively. 


Example 4.55 Suppose that component lifetime is exponentially distributed with 
parameter A. For a random sample of n=5 components, the expected value of the 
sample median lifetime is 


(Ys) = | 


0 


y- gs(y)dy = I Ysa (1-e®)"(e®) - de Pay 
Expanding out the integrand and integrating term by term, the expected value 
is .783/A. The median of the exponential distribution is, from solving F(7)=.5, 
4 = —In(.5)/A = .693/A. Thus if sample after sample of five components is selected, 
the long run average value of the sample median will be somewhat larger than the 
median value of the individual lifetime distribution. This is because the exponential 
distribution has a positive skew. P 


Here is the promised intuitive derivation of Eq. (4.9). Let A be a number quite 
close to 0, and consider the three intervals (—oo, y], (vy, y+ A], and (y+ A, oo). Fora 
single X, the probabilities of these three intervals are 


Pi=PK<y)=FQ) p= Pl <XK<y+A)=J}**4f@drxfG)-A 
p3=PX&>yt+A)=1-—Fot+A) 


For a random sample of size n, it is very unlikely that two or more Xs will fall in 
the middle interval, since its width is only A. The probability that the ith order 
statistic falls in the middle interval is then approximately the probability that i —1 of 
the Xs are in the first interval, one is in the middle, and the remaining n — i are in the 
third. This is just a trinomial probability: 


n! . 
G-Dilln@—! [FI -f0)-4-[1- FQ +A)" 

Dividing both sides by A and taking the limit as A — 0 gives exactly Expres- 
sion (4.9). That is, we may interpret the pdf g;(y) as loosely specifying that i— 1 
of the original observations are below y, one is “at” y, and the other n—i are 
above y. 

Similar reasoning works to intuitively derive the joint pdf of Y; and Y; (i<j). 
In this case there are five relevant intervals: (—oo, y,], (vi, vit Ail, OF + At, Yj), 
(yj, yj + Aa], and (y;+ Az, oo). 


Ply<¥;<y+A)* 
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4.9.3 The Joint Distribution of the n Order Statistics 


We now develop the joint pdf of Y;, Y2, ..., Y,. Consider first a random sample 
X\, X2, X3 of fuel efficiency measurements (mpg). The joint pdf of this random 
sample is 


Sf (x1,%2,%3) =f (x1) -f (x2) -f(x3) 


The joint pdf of Y,, Y2, Y3 will be positive only for values of y,, y2, y3 satisfying 
y1 <y2<y3. What is this joint pdf at the values y; = 28.4, y2=29.0, y3= 30.5? 
There are six different ways to obtain these ordered values: 


X, =28.4, X»=29.0, X3= 30.5 
X,=28.4, X»=30.5, X3=29.0 
X,=29.0, X»=28.4, X3= 30.5 
X,=29.0, X»=30.5, X3=28.4 
X, =30.5, X»=28.4, X3=29.0 
X, = 30.5, X»=29.0, X3=28.4 


These six possibilities come from the 3! ways to order the three numerical 
observations once their values are fixed. Thus 


g(28.4, 29.0, 30.5) = f (28.4) - f(29.0) - f(30.5) + «+» + £(30.5) - f(29.0) - f (28.4) 
= 3!f(28.4) - f(29.0) - f(30.5) 


Analogous reasoning with a sample of size n yields the following result: 


PROPOSITION 
Let g()1, Yo, ---, Y,) denote the joint pdf of the order statistics Y;, Yo, ..., Y, 
resulting from a random sample of X;s from a pdf f(x). Then 


) en ee) Vi = yo <2. =, 
0 


8015 Y20++ Yn) = otherwise 


For example, if we have a random sample of component lifetimes and the 
lifetime distribution is exponential with parameter A, then the joint pdf of the 
order statistics is 


2(y1, . Yn) = nipte 20it tn) O< yy < yy Cs’ Yn 


Example 4.56 Suppose X,, X2, X3, and X4 are independent random variables, each 
uniformly distributed on the interval from 0 to 1. The joint pdf of the four 
corresponding order statistics Y;, Yo, Y3, and Y4 is g(y1, yo, y3, y4)=4!-1 for 
O0<y) <yo<y3<y4< 1. The probability that every pair of X;s is separated by 
more than .2 is the same as the probability that Yy-—Y, >.2, Y3— Y2.>.2, and 
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Y,—Y3 > .2. This latter probability results from integrating the joint pdf of the Y;s 
over the region .6<y4< 1, .4<y3<y4—.2, .2<yo<y3—.2,0<y) <y2—.2: 


1 pyg—-2 py3—.2 py.—.2 
P(l—Yi > 2¥s-Ya> 2¥s—¥a>.2)=| | | | Aldy,dy,dy3dy, 
6J.4 2 0 
The inner integration gives 4!(y2—.2), and this must then be integrated between 
.2 and y3;—.2. Making the change of variable z.=y —.2, the integration of z> is 
from 0 to y3— .4. The result of this integration is 4!-(y3 — Ay 2. Continuing with the 
third and fourth integration, each time making an appropriate change of variable so 
that the lower limit of each integration becomes 0, the result is 


P(¥,—Y, > .2,Y3—Y2 > .2,Y4—Y3 > .2) = .4* = .0256 


A more general multiple integration argument for n independent uniform [0, B] 
rvs shows that the probability that all values are separated by more than some 
distance d is 


[1 -(n-1)d/B]” 0<d<B/(n-1) 
0 


P(all values are separated by more than d) = { d>Bin—1) 


As an application, consider a year that has 365 days, and suppose that the birth 
time of someone born in that year is uniformly distributed throughout the 365-day 
period. Then in a group of 10 independently selected people born in that year, the 
probability that all of their birth times are separated by more than 24 h (d =1 day) is 
d—(d10- 1)(1)/365)'° = .779. Thus the probability that at least two of the 10 birth 
times are separated by at most 24 his .221. As the group size n increases, it becomes 
more likely that at least two people have birth times that are within 24 h of each 
other (but not necessarily on the same day). For n = 16, this probability is .467, and 
for n= 17 it is .533. So with as few as 17 people in the group, it is more likely than 
not that at least two of the people were born within 24 h of each other. 

Coincidences such as this are not as surprising as one might think. The proba- 
bility that at least two people are born on the same calendar day (assuming equally 
likely birthdays) is much easier to calculate than what we have shown here; see the 
Birthday Problem in Example 1.22. = 


4.9.4 Exercises: Section 4.9 (133-142) 


133. A friend of ours takes the bus 5 days per week to her job. The five waiting 
times until she can board the bus are a random sample from a uniform 
distribution on the interval from 0 to 10 min. 
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134. 


135. 


136. 


137. 


138. 


139. 


(a) Determine the pdf and then the expected value of the largest of the five 
waiting times. 

(b) Determine the expected value of the difference between the largest and 
smallest times. 

(c) What is the expected value of the sample median waiting time? 

(d) What is the standard deviation of the largest time? 

Refer back to Example 4.54. Because n=4, the sample median is the 

average of the two middle order statistics, (Y2+Y3)/2. What is the expected 

value of the sample median, and how does it compare to the median of the 
population distribution? 

An insurance policy issued to a boat owner has a deductible amount of 

$1000, so the amount of damage claimed must exceed this deductible before 

there will be a payout. Suppose the amount (thousands of dollars) of a 

randomly selected claim is a continuous rv with pdf f(x) = 3/x* for x> 1. 

Consider a random sample of three claims. 

(a) What is the probability that at least one of the claim amounts exceeds 
$5000? 

(b) What is the expected value of the largest amount claimed? 

A store is expecting n deliveries between the hours of noon and | p.m. 

Suppose the arrival time of each delivery truck is uniformly distributed on 

this 1-h interval and that the times are independent of each other. What are 

the expected values of the ordered arrival times? 

Let X be the amount of time an ATM is in use during a particular 1-h period, 

and suppose that X has the cdf F(x) =x" for 0<x<1 (where 0> 1). Give 

expressions involving the gamma function for both the mean and variance of 
the ith smallest amount of time Y; from a random sample of 1 such time 
periods. 

The logistic pdf fix) =e “/(1 +e“)? for — 00 <x < oo is sometimes used to 

describe the distribution of measurement errors. 

(a) Graph the pdf. Does the appearance of the graph surprise you? 

(b) For a random sample of size n, obtain an expression involving the 
gamma function for the moment generating function of the ith smallest 
order statistic Y;. This expression can then be differentiated to obtain 
moments of the order statistics. [Hint: Set up the appropriate integral, 
and then let v= 1/(1+e *).] 

Let X represent a measurement error. It is natural to assume that the pdf f(x) 

is symmetric about 0, so that the density at a value —c is the same as the 

density at c (an error of a given magnitude is equally likely to be positive or 
negative). Consider a random sample of n measurements, where n=2k+1, 
so that Y;,, is the sample median. What can be said about E(Y; , ,)? If the 

X distribution is symmetric about some other value, so that value is the 

median of the distribution, what does this imply about E(Y; , ;)? [Hints: For 

the first question, symmetry implies that 1 — F(x) = P(X > x) = P(X < —x)= 

F(—x). For the second question, consider W = X — n; what is the median of 

the distribution of W?] 


404 4 Joint Probability Distributions and Their Applications 


140. The pdf of the second-largest order statistic, Y,_, can be obtained using 
reasoning analogous to how the pdf of Y,, was first obtained. 
(a) For any number y, Y,,_; < y if and only if at least n— 1 of the original Xs 
are < y. (Do you see why?) Use this fact to derive a formula for the cdf of 
Y,1 in terms of F, the cdf of the Xs. [Hint: Separate “at least n — 1” into 
two cases and apply the binomial distribution formula.] 
(b) Differentiate the cdf in part (a) to obtain the pdf of Y,,_;. Simplify and 
verify it matches the formula for g,_;(y) provided in this section. 
141. Use the intuitive argument sketched in this section to obtain the following 
general formula for the joint pdf of two order statistics Y; and Y; with i <j: 


gy) = 
n! 


FIG ho [FO-Foa] [1 Fo] 


ff Q;) for y; < y; 


142. Consider a sample of size n=3 from the standard normal distribution, and 
obtain the expected value of the largest order statistic. What does this say 
about the expected value of the largest order statistic in a sample of this size 
from any normal distribution? [Hint: With p(x) denoting the standard normal 
pdf, use the fact that (d/dx)p(x) = —x(x) along with integration by parts.] 


4.10 Simulation of Joint Probability Distributions and System 
Reliability 


In Chaps. 2 and 3, we saw several methods for simulating “generic” discrete and 
continuous distributions (in addition to built-in functions for binomial, Poisson, 
normal, etc.). Unfortunately, most of these general methods do not carry over easily 
to joint distributions or else require significant re-tooling. In this section, we briefly 
survey some simulation techniques for general bivariate discrete and continuous 
distributions and discuss how to simulate normal distributions in more than one 
dimension. We then consider simulations for the lifetimes of interconnected 
systems, in order to understand the reliability of such systems. 


4.10.1 Simulating Values from a Joint PMF 


Simulating two dependent discrete rvs X and Y can be rather tedious and is easier to 
understand with a specific example in mind. Suppose we desire to simulate (X, Y) 
values from the joint pmf in Example 4.1: 
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y 
py) | 0 100 200 
100 20 10 20 
250 05 15 30 


The exhaustive search approach uses the inverse cdf method of Sect. 2.8 by 
reformatting the table as a single row of (x, y) pairs along with cumulative 
probabilities. Starting in the upper left corner and going across, create “cumulative” 
probabilities for the entire table: 


y 

| 0 100 200 

100 20 30 50 
250 55 70 1 


Be careful not to interpret these increasing decimals as cumulative probabilities 
in the traditional sense, e.g., it is not the case that .70 in the preceding table 
represents P(X < 250 M Y< 100). For ease of reading, this same table has been 
rendered below as two parallel rows, one enumerating the (x, y) values and another 
with the corresponding cumulative probabilities. 


(x, y) | (100, 0) (100, 100) (100, 200) (250,00) (250, 100) __ (250, 250) 
cum. prob. | .20 30 50 55 70 1 


Now the simulation proceeds similarly to those illustrated in Fig. 2.10 for 
simulating a single discrete random variable: use if-else statements, specifying 
the pair of values (x, y) for each range of standard uniform random numbers. 
Figure 4.25 provides the needed Matlab and R code. 

In both languages, executing the code in Fig. 4.25 results in two vectors x and y 
that, when regarded as paired values, form a simulation of the original joint pmf. 
That is to say, if x and y were laid in parallel roughly 20% of the paired values 
would be (100, 0), about 10% would be (100, 100), and so on. 

At the end of Sect. 2.8 we mentioned that both Matlab and R have built-in 
functions to speed up the inverse cdf method (randsamp1le and samp1e, respec- 
tively) for a single discrete rv. Unfortunately, these are not designed to take pairs of 
values as an input, and so the lengthier code is required. You might be tempted to 
use these built-in functions to simulate the (marginal) pmfs of X and Y separately, 
but beware: by design, the resulting simulated values of X and Y would be 
independent, and the rvs displayed in the original joint pmf are clearly dependent. 
For example, (100, 0) ought to appear roughly 20% of the time in a simulation; 
however, separate simulations of X and Y will result in about 50% 100 s for X and 
25% 0s for Y, independently, meaning the pair (100, 0) will appear in approxi- 
mately (.5)(.25) = 12.5% of simulated (X, Y) values. 

It’s worth noting that the choice to add across rows first was arbitrary. We could 
just as well have added down the left-most column (Y = 0) of the original joint pmf 
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a b 
x=zeros(10000,1); y=x; x <- NULL; y <- NULL 
for i=1:10000 for (i in 1:10000) 
u=rand; u=runif (1) 
Lf u<.2 if (u<.2) 
x(i)=100; y(i)=0; x[i]<-100; y[il]l<-0} 
elseif u<.3 else if (u<.3){ 
x(i)=100; y(i)=100; x[i]<-100; y[i]<-1} 
elseif u<.5 else if (u<.5){ 
x(i)=100; y(i)=200; x[i]<-100; y[i]<-2} 
elseif u<.55 else if (u<.55){ 
x(i)=250; y(i)=0; x[i]<-250; y[i]<-0} 
elseif u<.7 else if (u<.7){ 
x(1)=250; y(i)=100; x[i]<-250; y[i]<-100} 
else else{ 
x(1)=250; y(i)=200; x[1]<-250; y[i]<-200} 
end } 
end 


Fig. 4.25 The exhaustive search method for simulating two discrete rvs: (a) Matlab; (b) R 


table, then the middle column, then the right column to create “cumulative” 
probabilities and then rewritten our code accordingly. 


4.10.2 Simulating Values from a Joint PDF 


As in the discrete case, a pair of independent continuous rvs X and Y can be 
simulated separately using any of the methods from Sect. 3.8 (inverse cdf, accept— 
reject). In the general case, however, the inverse cdf method breaks down in two or 
more dimensions, because we cannot “invert” the joint cdf of X and Y. Hence, we 
rely primarily on the accept—reject method. The following proposition repeats the 
algorithm from Sect. 3.8 but expands it to two dimensions; a simulation scheme for 
three or more dependent rvs would be analogous. 


ACCEPT-REJECT METHOD (bivariate case) 

It is desired to simulate n values from a joint pdf f(x, y). Let g(x) and go(y) be 

two univariate pdfs such that the ratio f/[g1g2] is bounded, i.e., there exists a 

constant c such that f(x, y)/[g1@)g2(y)] < c for all x and y. Proceed as follows: 

1. Generate a variate, x*, from the distribution g,; independently, generate a 
variate, y*, from the distribution go. This pair (x*, y*) is our candidate. 

2. Generate a standard uniform variate, u. 

3. If u-c- g)(x*)go(y*) < fa*, y*), then assign (x, y) = (x*, y*), Le., “accept” 
the candidate. Otherwise, reject (x*, y*) and return to step 1. 

These steps are repeated until n candidate pairs have been accepted. The 

resulting accepted pairs (x1, y1), -- -, (Xn, Yn) Constitute a simulation of a pair of 

random variables (X, Y) with the original joint pdf, f(x, y). 
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a b 
x=zeros (10000,1); y=x; x <- NULL; y <- NULL 
1=0.7 i <- 0 
while i<10000 while (i <10000) { 
xstar=random('unif',20,30); xstar <- runif(1,20,30) 
ystar=random('unif',20,30); ystar <- runif (1,20,30) 
u=rand; u <- runif (1) 
if u<=(xstar*2+ystar%2) /1800 if (u<=(xstar*2+ystar%2) /1800) { 
i=itl; i <= itl 
x(i)=xstar; x[i] <- xstar 
y(i)=ystar; y{i] <- ystar 
end } 
end } 


Fig. 4.26 Simulation code for Example 4.57: (a) Matlab; (b) R 


As in the one-dimensional case, the accept—reject method hinges on generating 
some other distribution on the same set of values as the “target” pdf. What’s special 
here is that we leverage our ability to simulate univariate distributions—namely, 
gi(x) and g>(y)—and create a candidate pair (x*, y*) from two independent rvs. In 
particular, the product g)(x*)g2(y*) that appears in the algorithm is the joint pdf of 
two independent rvs having marginal distributions g, and g5, respectively. 


Example 4.57 It is desired to simulate values from the following joint pdf, 
introduced in Exercise 11: 


4 f kG? +?) 20<x<30, 20<y<30 
fasy) = { 0 otherwise 


(Determining the constant of integration, k, won’t be necessary.) Since both 
X and Y are bounded between 20 and 30, a sensible choice for both g; and gz is the 
uniform distribution on [20, 30]. That is, g;(x) = 1/(30-20) =.1 for 20 < x < 30, and 
g2(y) = g1(y). The majorization constant c is determined by requiring 


k 2 2, 
POO). EE for 20 < x < 30,20 <y < 30 


8i(X)8o(y) — (-I)C-1) 


The left-hand expression is obviously maximized at x = y= 30, from which we 
have c > k(307+30°)/(.1)? = 180,000k. Setting c= 180,000k, the accept-—reject 
scheme for this joint pdf proceeds as follows: 

1. Generate independent x* ~ Unif[20, 30] and y* ~ Unif[20, 30]. 
2. Generate a standard uniform variate, u. 

3. Accept (x*, y*) iff u-c- g1Q*)go(y*) <f(x*, y*), ie., u- 180,000k - (.1)0.1) < 
k((x*)? +(y *)°). This is algebraically equivalent to u < ((x*)” + (y *)*)/1800. 
Figure 4.26 provides Matlab and R code for this example. The output of either one 

is a pair of vectors, x and y, whose paired values simulate the original joint pdf. 
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Figure 4.27 shows the joint pdf f(x, y) alongside a “three-dimensional histogram” 
of 10,000 (x, y) values simulated in Matlab (the latter was created using the hist3 
command). Observe that both show a slight rise as the x- or y-values increase from 
20 to 30. 


Fig. 4.27 Joint pdf (a) and histogram of simulated values (b) for Example 4.57 | 


As indicated in Sect. 3.8, it can be shown that the majorization constant c is also 
the expected number of candidates required to generate a single accepted value 
(here, a pair). In the preceding example, the numerical value of c turns out to be 
c=27/19 © 1.421, so we expect our programs to require about 14,210 iterations of 
the while loop to create 10,000 simulated valued of (X, Y). 

As an alternative to the accept—reject method, a technique based on conditional 
distributions can be employed. The basic idea is this: suppose X has pdf f(x) and, 
conditional on X = x, Y has conditional distribution /(ylx). Then one can simulate (X, 
Y) by first simulating from f(x) using the techniques of Sect. 3.8 and then, given the 
simulated value of x, simulating a value y from f(ylx). 


Example 4.58 Consider the following joint pdf, introduced in Exercise 14: 


—x(1+y) ->0 d >0 
_ f xe x> and y> 
uy) = { 0 otherwise 


Straightforward integration shows the marginal pdf of X to be fy(x) =e”, from 
which the conditional distribution of Y given X = x is 


i a 
flay = Se ae 
fx(x) e* 

Each of these has an algebraically simple cdf, so we will employ the inverse 
cdf method for each step. The cdf of X is F(x) = 1 — e™“, whose inverse is given by 
x=-In(1 — u). Similarly, the conditional cdf of Y given X =x is F(ylxy) =1—e™, 
whose inverse function (with respect to y) is y=—(1/x)In( — u). The resulting 
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a b 
x=zeros(10000,1); y=x; x <- NULL; y <- NULL 
for i=1:10000 for (i. tn. 1210000) 4 
u=rand; u<-runif (1) 
x (1) =-log(1-u); x[i]<- -log(1-u) 
v=rand; v<-runif (1) 
y (i) =- (1/x(i)) *log(1-v); ylil<- -(1/x[i]) *log(1-v) 
end } 


Fig. 4.28 Simulation code for Example 4.58: (a) Matlab; (b) R 


simulation code, in Matlab and R, appears in Fig. 4.28. Notice in each program that 
two standard uniform variates, u and v, are required: one to simulate x, and another 
to simulate y given x. 

Some simplifications can be made to the preceding code. As in many other 
simulations, the for loop can be vectorized (summoning all 10,000 simulated values 
at once). Additionally, you might recognize the pdfs under consideration: the 
marginal distribution of X is exponential with A=1, while Y given X= x is 
exponential with parameter 2 =x. Hence, we could exploit Matlab’s or R’s built- 
in exponential distribution simulator, rather than finding and inverting the cdfs. Ml 


This method can also be extended to three or more variables, but finding the 
required conditional pdfs from the joint pdf can be difficult. This conditional 
distributions method is best suited to so-called hierarchical models, where the 
distribution of each rv is specified conditional on its predecessors, e.g., we are 
provided initially with f(x), fOlx), f(zly,x), and so on. 

The conditional distributions approach may also be implemented to simulate a 
joint discrete distribution; see Exercise 149. 


4.10.3 Simulating a Bivariate Normal Distribution 


The prevalence of normal distributions makes the ability to simulate both univariate 
and multivariate normal rvs especially important. A simple method exists for 
simulating pairs from an arbitrary bivariate normal distribution, as indicated in 
the following proposition. 


PROPOSITION 
Let Z, and Z, be independent standard normal rvs and let 


Wi =Z, W2=p-214+V1-"°2r 


Then W, and W; have a bivariate normal distribution, each having mean 
0 and standard deviation 1, and Corr(W,, W2) =p. 
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a b 

zl=random('norm',0,1,[10000 1]); zl <- rnorm(10000) 
z2=random('norm',0,1,[10000 1]); z2 <- rnorm(10000) 

x=496+114*Z1; x <- 4964+114*z1 

y=5144+117* (.25*zl+sqrt (1-.25%2) *z2); y <- 5144+117* (.25*zl+sqrt (1-.25%2) *z2) 


Fig. 4.29 Code for Example 4.59: (a) Matlab (b) R 


This result can be proved using the transformation methods of Sect. 4.6. The 
means, variances, and correlation coefficient of W, and W> are established in 
Exercise 161. 

Now suppose we wish to simulate from a bivariate normal distribution with an 
arbitrary set of parameters j1, 0, fl2, 62, and p. Define X and Y by 


X= Myr o\W, = hr 0\Z1, 
Y = py + 0oW2 = fy + 02 (62: +VY1- p22) 


Since X and Y in Expression (4.10) are just linear functions of W, and Ws, it 
follows from Sect. 4.2 that Corr(X, Y) = Corr(W, W2) =p. Moreover, since W, and 
W, have mean zero and standard deviation 1, these linear transformations give 
X and Y the desired means and standard deviations. So, to simulate a bivariate 
normal distribution, create a pair of independent standard normal variates z; and Z>, 
and then apply the formulas for X and Y in Eq. (4.10). 


(4.10) 


Example 4.59 Consider the joint distribution of SAT reading and math scores 
described in Example 4.42. Using the parameters from that example, Eq. (4.10) 
becomes 


X=4964114Z,, Y=5144 117(.2521 a ce 25°22) 


Figure 4.29 shows this transformation implemented in Matlab and R; both 
programs have been vectorized and produce 10,000 (X, Y) pairs. 

Now define a new rv R= Y/X, the ratio of a student’s SAT Math and Critical 
Reading scores. Arguably, this measures a student’s math ability relative to her or 
his reading skills. Determining the pdf of R is simply not feasible, especially since 
X and Y are dependent. But the above simulation, along with the command r=y/x 
Gin R, or r=y./x in Matlab) gives us information about its distribution. A 
histogram of the simulated values of R appears in Fig. 4.30. For these 10,000 
simulated values, the sample mean and standard deviation are r = 1.0161 and 
5 = 0.1677. So, we estimate the true expected ratio E(R) for all students that took the 
SAT in Fall 2012 is 1.0161, with an estimated standard error of 


s//n = 0.1677/,/10,000 = .001677. 
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Fig. 4.30 Histogram of R = Y/X from Example 4.59 a 


Matlab and R also have built-in programs to simulate multivariate normal 
distributions that work for 2 or more dimensions and do not rely on the preceding 
proposition. In fact, users have created several such tools in R; we illustrate here the 
function available in the mvtnorm package. The mvnrnd function in Matlab and 
the rmvnorm function in R take three inputs: the desired number of simulated 
values (in the bivariate case, simulated pairs), a vector of means, and a covariance 
matrix (see the end of Sect. 4.7). Figure 4.31 illustrates these commands for the 
distribution specified in Example 4.59. 


a b 

mu=[496, 514]; mu <- c(496,514) 

C=[114%2, .25*114*117; C <- matrix(c(114%2, .25*114*117, 
-25*114*117, 11742]; -25"114"117, 11742) ,272) 

x=mvnrnd(mu,C,10000) x <- rmvnorm(10000,mu,C) 


Fig. 4.31 Built-in multivariate normal simulations: (a) Matlab; (b) R 


4.10.4 Simulations Methods for Reliability 


One area of application for the simulation methods presented in this section is to the 
lifetime distributions of complex systems. It can sometimes be difficult to derive the 
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exact pdf of the lifetime of a system comprised of many components (in series 
and/or parallel), but simulation provides a way out. 


Example 4.60 Consider the system described in Example 4.48; this is actually a 
comparatively simple configuration. Let T,, T>, and T3 denote the lifetimes of the 
three components. Since components 1 and 2 are connected in series, the “1— 
2 subsystem” functions only as long as the smaller of 7, and 7), e.g., if 
T,=135 h and 7,=119 h, then the lifetime of the 1—2 subsystem is 119 h. 
The lifetime of the 1-2 subsystem, therefore, can be expressed mathematically as 
min(T), T>). 

Similarly, the 1-2 subsystem is linked in parallel with component 3, and so 
the lifetime of the overall system is the Jarger of the lifetimes of the two pieces 
(the 1-2 subsystem and component 3). For example, if the lifetime of the 1-2 
subsystem is 119 h and the lifetime of component 3 is 127 h, then the overall system 
lifetime is 127 h. If we let Ty, denote the system lifetime, then we have 


Tsys = max(1—2 subsystem lifetime, component 3 lifetime) 
= max(min(7) ,T2), T3) 


This expression combining max and min functions can be used to simulate the 
system lifetime, provided we have models (that we can simulate) for the lifetimes of 
the three individual components. Figure 4.32 shows example code for simulating 
the system lifetime assuming each of the three components has an exponentially 
distributed lifetime with mean 100 h. The 10,000 simulation runs have been 
vectorized to accelerate the process. Notice that the R code requires using the 
pmax and pmin commands, which treat their inputs as parallel vectors and find the 
“row-wise” maximum or minimum. 


a b 
T1=-100*log(1l-rand(10000,1)); Tl <- -100*log(1-runif (10000) ) 
T2=-100*log(1l-rand(10000,1)); T2 <- -100*log(1-runif (10000) ) 
T3=-100*log(1l-rand(10000,1)); TS <= =100* log (l=runif (10000) ) 
Tsys=max (min(T1,T2),T3); Tsys <- pmax(pmin(T1,T2),T3) 


Fig. 4.32 Simulation code for Example 4.60: (a) Matlab; (b) R 


A histogram of 10,000 simulated values of T,,, from R appears in Fig. 4.33; this 
should look similar to the pdf of that rv. We can use these same simulated values to 
estimate the expectation and standard deviation of T,y,: for our run, the sample 
mean and standard deviation were 115.37 h and 93.49 h, respectively. 
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Fig. 4.33 A histogram of simulated values of T,,, in Example 4.60 a 


4.10.5 Exercises: Section 4.10 (143-153) 


143. Consider the service station scenario presented in Exercise | of this chapter. 

(a) Write a program to simulate the rvs (X, Y) described in that exercise. 

(b) Use your program to estimate P(X <1 and Y< 1), and compare your 
estimate to the exact answer from the joint pmf. Use at least 10,000 
simulation runs. 

(c) Define a new variable D=|lX —YI, the (absolute) difference in the 
number of hoses in use at the two gas pumps. Use your program (with 
at least 10,000 runs) to simulate D, and estimate both the mean and 
standard deviation of D. 

144. Refer back to the quiz scenario of Exercise 24. 

(a) Write a program to simulate students’ scores (X, Y) on the two parts of 
the quiz. 

(b) Use your program to estimate the probability that a student’s total score 
is at least 20 points. How does your estimate compare to the exact 
answer from the joint pmf? 
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(c) Define a new rv M = the maximum of the two scores. Use your program 

to simulate M, and estimate both the mean and standard deviation of M. 
Consider the situation presented in Example 4.13: the joint pdf of the 
amounts X and Y of almonds and cashews, respectively, in a 1-lb can of 
nuts is 


_ j24xy O<x<l, O<y<l, xt+y<l 
foxy) ={ 0 otherwise 


With the prices specified in that example, the total cost of the contents of 
one can is W=3.5+2.5X+6.5Y. 

(a) Write a program implementing the accept—reject method of this section 
to simulate (X, Y). 

(b) On the average, how many iterations will your program require to 
generate 10,000 “accepted” (X, Y) pairs? 

(c) Use your program to simulate the rv W. Create a histogram of the 
simulated values of W, and report estimates of the mean and standard 
deviation of W. How close is your sample mean to the value E(W) = 
$7.10 determined in Example 4.13? 

(d) Use your simulation in part (c) to estimate the probability that the cost of 
the contents of a can of nuts exceeds $8. 

Suppose a randomly chosen individual’s verbal score X and quantitative 

score Y on a nationally administered aptitude examination, each scaled down 

to [0, 1], have joint pdf 


2 
=(2x+3y) O<x<1, O0<y<1 
fmy= 49 


0 otherwise 


(a) Write a program implementing the accept—reject method of this section 
to simulate (X, Y). 

(b) The engineering school at a certain university uses a weighted total 
T=3X+7Y as part of its admission process. Use your program in part 
(a) to simulate the rv T, and estimate P(T > 9). 

(c) Suppose the engineering school decides to only admit students whose 
weighted totals are above the 85th percentile for the national distribu- 
tion. That is, if 7.35 is the 85th percentile of the distribution of T, a 
student’s weighted total must exceed 4.5 for admission. Use your 
simulated values of T from part (b) to estimate 7 35. [Hint: n.g5 separates 
the bottom 85% of the T distribution from the remaining 15%. What 
value separates the lowest 85% of simulated T values from the rest?] 

Refer back to Exercise 145. 

(a) Determine the marginal pdf of X and the conditional pdf of Y given 
X=xX. 
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(b) Write a program to simulate (X, Y) using the conditional distributions 
method presented in this section. 

(c) What advantage does this method have over the accept—reject approach 
used in Exercise 145? 

Consider the situation in Example 4.31: the proportion P of tiles meeting 

thermal specifications varies according to the pdf f(p)=9p*, 0<p<1; 

conditional on P =p, the number of inspected tiles that meet specifications 

is arv Y~Bin(20, p). 

(a) Write a program to simulate Y. Your program will first need to simulate a 
value of P, and then generate a variate from the appropriate binomial 
distribution. [Hint: Use your software’s built-in binomial simulation 
tool.] 

(b) Simulate (at least) 10,000 values of Y, and report estimates of both E(Y) 
and Var(Y). How do these compare to the exact answers found in 
Example 4.31? 

(c) Use your simulation to estimate both P(Y = 18) and P(Y > 18). 

The conditional distributions method described in this section can also be 

applied to joint discrete rvs. Refer back to the joint pmf presented in this 

section, which is originally from Example 4.1. 

(a) Determine the marginal pmf of X. (This should be very easy.) 

(b) Determine the conditional pmfs of Y given X = 100 and given X = 250. 

(c) Write a program that first simulates X using its marginal pmf, then 
simulates Y via the appropriate conditional pmf. [Hint: For each stage, 
use your program’s built-in discrete simulator (randsample in 
Matlab, sample in R).] 

(d) Use your program in part (c) to simulate at least 10,000 (X, Y) pairs. 
Verify that the relative frequencies of the six possible pairs in your 
sample are close to the probabilities specified in the original joint pmf 
table. 

Refer back to Exercise 113, which specifies a bivariate normal distribution 

for the rvs X = height (inches) and Y = weight (Ibs) for American males. The 

parameters of that model were y; = 70, 0; = 3, #2 = 170, o2 = 20, and p=.9. 

(a) Use your software’s built-in multivariate normal simulation function to 
generate (at least) 10,000 (X, Y) pairs according to this bivariate normal 
model. 

(b) A person’s body-mass index (BMI) is determined by the formula 
703Y/X*. Use the result of part (a) to create a histogram of BMIs for 
the population of American males. 

(c) BMI scores between 18.5 and 25 are considered healthy. By that crite- 
rion, what proportion of American males are healthy? Report both an 
estimate of this proportion and its estimated standard error. 

The conditional distributions method of this section can be implemented to 

simulate a bivariate normal distribution, providing an alternative to built-in 
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multivariate simulation tools or Expression (4.10). Let X and Y have a 

bivariate normal distribution with parameters 4), 0), 2, 02, and p. 

(a) What are the marginal distribution of X and the conditional distribution 
of Y given X = x? [Hint: Refer back to Sect. 4.7.] 

(b) Write a program to simulate (X, Y) values from a bivariate normal 
distribution by first simulating X and then YIX =x. The inputs to your 
program should be the five parameters and the desired number of 
simulated values; the outputs should be vectors containing the simulated 
values of X and Y. 

(c) Use your program to simulate the height-weight distribution from the 
previous exercise. Verify that the sample mean and standard deviation of 
your simulated Y values are roughly 170 and 20, respectively. 

Consider the system design illustrated in Exercise 126. Suppose that 

components 1, 2, and 3 have exponential lifetimes with mean 250 h, while 

components 4, 5, and 6 have exponential lifetimes with mean 300 h. 

(a) Write a program to simulate the lifetime of the system. 

(b) Let yw denote the true mean system lifetime. Provide an estimate of y, 
along with its estimated standard error. 

(c) Let p denote the true probability that the system lasts more than 200 h. 
Provide an estimate of p, along with its estimated standard error. 

Consider the system design illustrated in Exercise 127. Suppose the 

odd-numbered components have exponential lifetimes with mean 250 h, 

while the even-numbered components have gamma lifetime distributions 

with a=2 and P= 125. (This second distribution also has mean 250 h.) 

(a) Write a program to simulate the lifetime of the system. [You might want 
to use your software’s built-in gamma random number generator. ] 

(b) Let w denote the true mean system lifetime. Provide an estimate of p, 
along with its estimated standard error. 

(c) Let p denote the true probability that the system fails prior to 400 h. 
Provide an estimate of p, along with its estimated standard error. 


Supplementary Exercises (154-192) 


Suppose the amount of rainfall in one region during a particular month has an 
exponential distribution with mean value 3 in., the amount of rainfall in a 
second region during that same month has an exponential distribution with 
mean value 2 in., and the two amounts are independent of each other. What is 
the probability that the second region gets more rainfall during this month 
than does the first region? 

Two messages are to be sent. The time (min) necessary to send each message 
has an exponential distribution with parameter A= 1, and the two times are 
independent of each other. It costs $2 per minute to send the first message 
and $1 per minute to send the second. Obtain the density function of the total 
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cost of sending the two messages. [Hint: First obtain the cumulative distri- 
bution function of the total cost, which involves integrating the joint pdf.] 
A restaurant serves three fixed-price dinners costing $20, $25, and $30. For a 
randomly selected couple dining at this restaurant, let X =the cost of the 
man’s dinner and Y = the cost of the woman’s dinner. The joint pmf of X and 
Y is given in the following table: 


y 
poy) | 20 25 30 

20 05 05 10 

x 25 05 10 35 
30 0 20 10 


(a) Compute the marginal pmfs of X and Y. 

(b) What is the probability that the man’s and the woman’s dinner cost at 
most $25 each? 

(c) Are X and Y independent? Justify your answer. 

(d) What is the expected total cost of the dinner for the two people? 

(e) Suppose that when a couple opens fortune cookies at the conclusion of 
the meal, they find the message “You will receive as a refund the 
difference between the cost of the more expensive and the less expensive 
meal that you have chosen.” How much does the restaurant expect to 
refund? 

A health-food store stocks two different brands of a type of grain. Let X = the 

amount (Ib) of brand A on hand and Y=the amount of brand B on hand. 

Suppose the joint pdf of X and Y is 


fk x>0, y>0, 200<x+y< 30 
fy) = { 0 otherwise 


(a) Draw the region of positive density and determine the value of k. 

(b) Are X and Y independent? Answer by first deriving the marginal pdf of 
each variable. 

(c) Compute P(X + Y < 25). 

(d) What is the expected total amount of this grain on hand? 

(e) Compute Cov(X, Y) and Corr(X, Y). 

(f) What is the variance of the total amount of grain on hand? 

Let X,, X, ..., X,, be random variables denoting n independent bids for an 

item that is for sale. Suppose each X; is uniformly distributed on the interval 

[100, 200]. If the seller sells to the highest bidder, how much can he expect to 

earn on the sale? [Hint: Let Y=max(X,, X>, ..., X,). Use the results of 

Sect. 4.9 to find E(Y).] 

Suppose a randomly chosen individual’s verbal score X and quantitative 

score Y on a nationally administered aptitude examination have joint pdf 
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(2x + 3y) O0<x<1 O0<y<l 


aI DN 


f(x,y) = 


0 otherwise 


You are asked to provide a prediction ¢ of the individual’s total score 
X +Y. The error of prediction is the mean squared error E[(X + Y — ae What 
value of t minimizes the error of prediction? 

Let X, and X, be quantitative and verbal scores on one aptitude exam, 
and let Y, and Y> be the corresponding scores on another exam. If 
Cov(X,, Y;)=5, Cov(X;, Yo) =1, Cov(X2, Y;)=2, and Cov(X>, Y2)=8, 
what is the covariance between the two total scores X, +X> and Y,+Y,? 
Let Z, and Z, be independent standard normal rvs and let 


W,=Z, W2=p-Z,+V1-p°22 


(a) By definition, W; has mean 0 and standard deviation 1. Show that the 
same is true for W>. 
(b) Use the properties of covariance to show that Cov(W,, W2) =p. 
(c) Show that Corr(W,, W>) =p. 
You are driving on a highway at speed X,. Cars entering this highway after 
you travel at speeds Xz, X3, .... Suppose these X;s are independent and 
identically distributed. Unfortunately there is no way for a faster car to 
pass a slower one—it will catch up to the slower one and then travel at the 
same speed. For example, if X; = 52.3, X2 = 37.5, and X3 = 42.8, then no car 
will catch up to yours, but the third car will catch up to the second. Let 
N=the number of cars that ultimately travel at your speed (in your 
“cohort”), including your own car. Possible values of N are 1, 2, 3, .... 
Show that the pmf of N is p(n)=1/[n(n+1)], and then determine the 
expected number of cars in your cohort. [Hint: N= 3 requires that X, < Xp, 
XX, <X3, X4< X).] 

Suppose the number of children born to an individual has pmf p(x). A Galton— 

Watson branching process unfolds as follows: At time t=0, the population 

consists of a single individual. Just prior to time t= 1, this individual gives 

birth to X, individuals according to the pmf p(x), so there are X, individuals in 
the first generation. Just prior to time t= 2, each of these X, individuals gives 

birth independently of the others according to the pmf p(x), resulting in X 

individuals in the second generation (e.g., if X;=3, then X,=Y,+Y2+Y3, 

where Y; is the number of progeny of the ith individual in the first generation). 

This process then continues to yield a third generation of size X3, and so on. 

(a) IfX, =3, Y; =4, Y2=0, Y3= 1, draw a tree diagram with two generations 
of branches to represent this situation. 

(b) Let A be the event that the process ultimately becomes extinct (one way 
for A to occur would be to have X; =3 with none of these three second- 
generation individuals having any progeny) and let p* = P(A). Argue that 
p* satisfies the equation 
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pe = D7 (p*)"- p@) 


[Hint: A=Uv io (AN X, =x), so the Law of Total Probability can be 
applied. Now given that X, =3, A will occur if and only if each of the 
three separate branching processes starting from the first generation 
ultimately becomes extinct; what is the probability of this happening?] 

(c) Verify that one solution to the equation in (b) is p* = 1. It can be shown 
that this equation has just one other solution, and that the probability 
of ultimate extinction is in fact the smaller of the two roots. If p(O) = .3, 
p(1) =.5, and p(2) = .2, what is p*? Is this consistent with the value of , 
the expected number of progeny from a single individual? What happens 
if p(O) =.2, p(1) =.5, and p(2) = .3? 

Let f(x) and g(y) be pdfs with corresponding cdfs F(x) and G(y), respectively. 

With c denoting a numerical constant satisfying Icl < 1, consider 


f(xy) =fa)sO){l + cl2F@) — H[2GQ) — 1} 


(a) Show that f(x, y) satisfies the conditions necessary to specify a joint pdf for 
two continuous rvs. 

(b) What is the marginal pdf of the first variable X? Of the second variable Y? 

(c) For what values of c are X and Y independent? 

(d) If f(x) and g(y) are normal pdfs, is the joint distribution of X and 
Y bivariate normal? 

The joint cumulative distribution function of two random variables X and Y, 

denoted by F(x, y), is defined by 


F(x,y) = P[(X <x)N(Y < y)] —-o<x<w, -w<y<o 


(a) Suppose that X and Y are both continuous variables. Once the joint cdf is 
available, explain how it can be used to determine P((X, Y) © A), where 
A is the rectangular region {(x, y):a<x<b,c<y<d}. 

(b) Suppose the only possible values of X and Y are 0, 1, 2, ... and consider 
the values a=5, b= 10, c=2, and d= 6 for the rectangle specified in (a). 
Describe how you would use the joint cdf to calculate the probability that 
the pair (X, Y) falls in the rectangle. More generally, how can the 
rectangular probability be calculated from the joint cdf if a, b, c, and 
d are all integers? 

(c) Determine the joint cdf for the scenario of Example 4.1. [Hint: First 
determine F(x, y) for x= 100, 250 and y=0, 100, and 200. Then describe 
the joint cdf for various other (x, y) pairs.] 

(d) Determine the joint cdf for the scenario of Example 4.3 and use it to 
calculate the probability that X and Y are both between .25 and .75. [Hint: 
For0<x<1land0<y<1, F(x, y)= Ps Jo fa, v)dvdu.| 

(e) Determine the joint cdf for the scenario of Example 4.4. [Hint: Proceed as 
in (d), but be careful about the order of integration and consider separately 
(x, y) points that lie inside the triangular region of positive density and 
then points that lie outside this region.] 
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A circular sampling region with radius X is chosen by a biologist, where X has 
an exponential distribution with mean value 10 ft. Plants of a certain type 
occur in this region according to a (spatial) Poisson process with “rate” .5 
plant per square foot. Let Y denote the number of plants in the region. 

(a) Find E(Y| X =x) and Var(YI X =x). 

(b) Use part (a) to find E(Y). 

(c) Use part (a) to find Var(Y). 

The number of individuals arriving at a post office to mail packages during a 

certain period is a Poisson random variable X with mean value 20. Indepen- 

dently of each other, any particular customer will mail either 1, 2, 3, or 

4 packages with probabilities .4, .3, .2, and .1, respectively. Let Y denote the 

total number of packages mailed during this time period. 

(a) Find E(Y| X =x) and Var(YI X =x). 

(b) Use part (a) to find E(Y). 

(c) Use part (a) to find Var(Y). 

Sandstone is mined from two different quarries. Let X = the amount mined 

(in tons) from the first quarry in one day and Y= the amount mined (in tons) 

from the second quarry in one day. The variables X and Y are independent, 

with py = 12, ox =4, wy = 10, oy =3. 

(a) Find the mean and standard deviation of the variable X + Y, the total 
amount of sandstone mined in a day. 

(b) Find the mean and standard deviation of the variable X — Y, the difference 
in the mines’ outputs in a day. 

(c) The manager of the first quarry sells sandstone at $25/t, while the manager 
of the second quarry sells sandstone at $28/t. Find the mean and standard 
deviation for the combined amount of money the quarries generate in 
a day. 

(d) Assuming X and Y are both normally distributed, find the probability the 
quarries generate more than $750 revenue in a day. 

The article “Stochastic Modeling for Pavement Warranty Cost Estimation” 
(J. of Constr. Engr. and Mgmt., 2009: 352 —359) proposes the following 
model for the distribution of Y=time to pavement failure. Let X, be the 
time to failure due to rutting, and X> be the time to failure due to transverse 
cracking; these two rvs are assumed independent. Then Y= min(X,, X2). The 
probability of failure due to either one of these distress modes is assumed to be 
an increasing function of time ¢. After making certain distributional 
assumptions, the following form of the cdf for each mode is obtained: 


( a+bt ) 

Vc+dt+ et? 

where ® is the standard normal cdf. Values of the five parameters a, b, c, d, 
and e are —25.49, 1.15, 4.45, -1.78, and .171 for cracking and —21.27, .0325, 


.972, -.00028, and .00022 for rutting. Determine the probability of pavement 
failure within t=5 years and also t= 10 years. 
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Consider a sealed-bid auction in which each of the n bidders has his/her 
valuation (assessment of inherent worth) of the item being auctioned. The 
valuation of any particular bidder is not known to the other bidders. Suppose 
these valuations constitute a random sample X,, ..., X,, with corresponding 
order statistics Yj; <Y2<---<Y,. The rent of the winning bidder is the 
difference between the winner’s valuation and the price. The article “Mean 

Sample Spacings, Sample Size and Variability in an Auction-Theoretic 

Framework” (Oper. Res. Lett., 2004: 103-108) argues that the rent is just 

Y,— Yn-1 (do you see why?). 

(a) Suppose that the valuation distribution is uniform on [0, 100]. What is the 
expected rent when there are n= 10 bidders? 

(b) Referring back to (a), what happens when there are 11 bidders? More 
generally, what is the relationship between the expected rent for n bidders 
and for n+ 1 bidders? Is this intuitive? [Note: The cited article presents a 
counterexample. ] 

Suppose two identical components are connected in parallel, so the system 
continues to function as long as at least one of the components does so. The 
two lifetimes are independent of each other, each having an exponential 
distribution with mean 1000 h. Let W denote system lifetime. Obtain the 
moment generating function of W, and use it to calculate the expected 
lifetime. 

Let Yo denote the initial price of a particular security and Y,, denote the price at 

the end of n additional weeks for n= 1, 2,3, .... Assume that the successive 

price ratios Y,/Yo, Y2/Y;, Y3/Y2, ... are independent of one another and that 
each ratio has a lognormal distribution with « = .4 and o = .8 (the assumptions 
of independence and lognormality are common in such scenarios). 

(a) Calculate the probability that the security price will increase over the 
course of a week. 

(b) Calculate the probability that the security price will be higher at the end of 
the next week, be lower the week after that, and then be higher again at the 
end of the following week. [Hint: What does “higher” say about the ratio 
Vis /¥;?] 

(c) Calculate the probability that the security price will have increased by at 
least 20% over the course of a five-week period. [Hint: Consider the ratio 
Y;/Yo, and write this in terms of successive ratios Y;,/Y;.] 

In cost estimation, the total cost of a project is the sum of component 

task costs. Each of these costs is a random variable with a probability 

distribution. It is customary to obtain information about the total cost 
distribution by adding together characteristics of the individual component 
cost distributions—this is called the “roll-up” procedure. For example, 

E(X, +--+ +X,)=E(X,) +--+ +E(X,), so the roll-up procedure is valid for 

mean cost. Suppose that there are two component tasks and that X; and X> are 

independent, normally distributed random variables. Is the roll-up procedure 
valid for the 75th percentile? That is, is the 75th percentile of the distribution 
of X, + X> the same as the sum of the 75th percentiles of the two individual 
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distributions? If not, what is the relationship between the percentile of the sum 

and the sum of percentiles? For what percentiles is the roll-up procedure valid 

in this case? 

Suppose that for a certain individual, calorie intake at breakfast is a random 

variable with expected value 500 and standard deviation 50, calorie intake at 

lunch is random with expected value 900 and standard deviation 100, and 
calorie intake at dinner is a random variable with expected value 2000 and 
standard deviation 180. Assuming that intakes at different meals are indepen- 
dent of each other, what is the probability that average calorie intake per day 
over the next (365-day) year is at most 3500? [Hint: Let X;, Y;, and Z; denote 
the three calorie intakes on day 7. Then total intake is given by )\(X;+ Y;+Z;)).] 

The mean weight of luggage checked by a randomly selected tourist-class 

passenger flying between two cities on a certain airline is 40 lb, and the 

standard deviation is 10 lb. The mean and standard deviation for a business- 
class passenger are 30 lb and 6 |b, respectively. 

(a) If there are 12 business-class passengers and 50 tourist-class passengers 
on a particular flight, what are the expected value of total luggage weight 
and the standard deviation of total luggage weight? 

(b) If individual luggage weights are independent, normally distributed rvs, 
what is the probability that total luggage weight is at most 2500 1b? 

Random sums. If X,, Xo, ... ,X, are independent rvs, each with the same mean 

value yw and variance o”, then we have seen that E(X,+X2+--++X,) = np and 

Var(X,+X2+---+X,)= no’. In some applications, the number of X;s under 

consideration is not a fixed number v but instead a rv N. For example, let N be 

the number of components of a certain type brought into a repair shop on a 

particular day and let X; represent the repair time for the ith component. Then 

the total repair time is Ty = X,+X2+---+Xvy, the sum of a random number of 
rvs. 

(a) Suppose that N is independent of the X;s. Use the Law of Total Expecta- 
tion to obtain an expression for E(Ty)) in terms of p and E(N). 

(b) Use the Law of Total Variance to obtain an expression for Var(T) in 
terms of py, o’, E(N), and Var(N). 

(c) Customers submit orders for stock purchases at a certain online site 
according to a Poisson process with a rate of 3 per hour. The amount 
purchased by any particular customer (in thousands of dollars) has an 
exponential distribution with mean 30. What is the expected total amount 
($) purchased during a particular 4-h period, and what is the standard 
deviation of this total amount? 

Suppose the proportion of rural voters in a certain state who favor a particular 

gubernatorial candidate is .45 and the proportion of suburban and urban voters 

favoring the candidate is .60. If a sample of 200 rural voters and 300 urban and 
suburban voters is obtained, what is the approximate probability that at least 

250 of these voters favor this candidate? 
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Let yw denote the true pH of a chemical compound. A sequence of 

n independent sample pH determinations will be made. Suppose each sample 

pH is a random variable with expected value y and standard deviation .1. How 

many determinations are required if we wish the probability that the sample 
average is within .02 of the true pH to be at least .95? What theorem justifies 
your probability calculation? 

The amount of soft drink that Ann consumes on any given day is independent 

of consumption on any other day and is normally distributed with “= 13 oz 

and o=2. If she currently has two six-packs of 16-o0z bottles, what is the 
probability that she still has some soft drink left at the end of 2 weeks 

(14 days)? Why should we worry about the validity of the independence 

assumption here? 

A large university has 500 single employees who are covered by its dental 

plan. Suppose the number of claims filed during the next year by such an 

employee is a Poisson rv with mean value 2.3. Assuming that the number of 
claims filed by any such employee is independent of the number filed by any 
other employee, what is the approximate probability that the total number of 

claims filed is at least 1200? 

A student has a class that is supposed to end at 9:00 a.m. and another that is 

supposed to begin at 9:10 a.m. Suppose the actual ending time of the 9 a.m. 

class is a normally distributed rv X, with mean 9:02 and standard deviation 

1.5 min and that the starting time of the next class is also a normally 

distributed rv X, with mean 9:10 and standard deviation 1 min. Suppose 

also that the time necessary to get from one classroom to the other is a 

normally distributed rv X3 with mean 6 min and standard deviation | min. 

What is the probability that the student makes it to the second class before the 

lecture starts? (Assume independence of X,, X2, and X3, which is reasonable if 

the student pays no attention to the finishing time of the first class.) 

This exercise provides an alternative approach to establishing the properties 

of correlation. 

(a) Use the general formula for the variance of a linear combination to write 
an expression for Var(aX + Y). Then let a= oy/oy, and show that p >-1. 
[Hint: Variance is always > 0, and Cov(X, Y) =oy: oy: p.] 

(b) By considering Var(aX — Y), conclude that p< 1. 

(c) Use the fact that Var(W) =0 only if W is a constant to show that p= 1 
only if Y=aX +b. 

A rock specimen from a particular area is randomly selected and weighed two 

different times. Let W denote the actual weight and X, and X> the two 

measured weights. Then X,;=W+E, and X>=W+E,, where E, and E> are 
the two measurement errors. Suppose that the Fs are independent of each 
other and of W and that Var(£,) = Var(E2) = oF : 

(a) Express p, the correlation coefficient between the two measured weights 
X, and X>, in terms of OW, the variance of actual weight, and ox, the 
variance of measured weight. 

(b) Compute p when ow = | kg and og = .01 kg. 
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Let A denote the percentage of one constituent in a randomly selected rock 
specimen, and let B denote the percentage of a second constituent in that same 
specimen. Suppose D and E are measurement errors in determining the values 
of A and B so that measured values are X = A + D and Y=B + E, respectively. 
Assume that measurement errors are independent of each other and of actual 
values. 

(a) Show that 


Corr(X, Y) = Corr(A, B) « \/Corr(X1, X2) - \/Corr(¥1, ¥2) 


where X, and X> are replicate measurements on the value of A, and Y,; and Y2 

are defined analogously with respect to B. What effect does the presence of 

measurement error have on the correlation? 

(b) What is the maximum value of Corr(X, Y) when Corr(X,, X2) = .8100 and 
Corr(Y,, Y2) = .9025? Is this disturbing? 


Let X1, ..., X, be independent rvs with mean values py, ..., #, and variances 
OF, . 6,07. Consider a function h(x), ..., X,), and use it to define a new 
tv Y=A(X,, ..., X,). Under rather general conditions on the h function, 


if the ojs are all small relative to the corresponding ys, it can be shown that 
EY) h(wy, «+5 Mn) and 


Oh\ 5 oh : 
Var(Y) = (=) Op tet + (2) 07 
where each partial derivative is evaluated at (x), ..., %,)=(M1, ---» Un): 


Suppose three resistors with resistances X,, X2, X3 are connected in parallel 
across a battery with voltage X4,. Then by Ohm’s law, the current is 


1 1 1 
Y=X 
4 (z Ty +z) 
Let vw, = 10Q, 6, =1.0Q, wo = 15Q0, 02=1.0Q, py =20Q0, 63= 1.50, 
H4= 120 V, og =4.0 V. Calculate the approximate expected value and stan- 
dard deviation of the current (suggested by “Random Samplings,” 


CHEMTECH, 1984: 696-697). 
A more accurate approximation to E[h(X,, ..., X,,)] in the previous exercise is 


1 07h 1 07h 
Aes 2+ fn) 5 s($3 2) + an a) (52) 


Compute this for Y= h(X,, X2, X3, X4) given in the previous exercise, and 
compare it to the leading term h(wy, ..., Hy). 
Let Y, and Y,, be the smallest and largest order statistics, respectively, from a 
random sample of size n. 
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(a) Use the result of Exercise 141 to determine the joint pdf of Y, and Y,, 
(Your answer will include the pdf f and cdf F of the original random 
sample.) 

(b) Let W,=Y, and W2=Y,,— Y, (the latter is the sample range). Use the 
method of Sect. 4.6 to obtain the joint pdf of W, and Wo, and then derive 
an expression involving an integral for the pdf of the sample range. 

(c) For the case in which the random sample is from a uniform distribution on 
[0, 1], carry out the integration of (b) to obtain an explicit formula for the 
pdf of the sample range. [Hint: For the Unif[0, 1] distribution, what are 
fand F?] 

Consider independent and identically distributed random variables X,, X>, 

X3, ... where each X; has a discrete uniform distribution on the integers 0, 

1,2,...,9; that is, P(X; =k) = 1/10 for k=0, 1, 2,..., 9. Now form the sum 


n 1 
U, = ) — SX; = 1X, + .01X, +--+ + (.1)"X). 
ay 1 2 (.1) 


Intuitively, this is just the first n digits in the decimal expansion of a random 
number on the interval [0, 1]. Show that as n — oo, P(U, < u) — PU < u) 
where U ~ Unif[0, 1] (this is called convergence in distribution, the type of 
convergence involved in the CLT) by showing that the moment generating 
function of U,, converges to the moment generating function of U. 

[The argument for this appears on p. 52 of the article “A Few Counter 
Examples Useful in Teaching Central Limit Theorems,” The American Stat- 
istician, Feb. 2013.] 

The following example is based on “Conditional Moments and Indepen- 
dence” (The American Statistician, 2008: 219). Consider the following joint 
pdf of two rvs X and Y: 


le [(Inx)?+(Iny)?]/2 


f(xy) = [1 + sin (2mInx) sin(2lny)| for x>0, y>0 


2n xy 


(a) Show that the marginal distribution of each rv is lognormal. [Hint: When 
obtaining the marginal pdf of X, make the change of variable u = In(y).] 

(b) Obtain the conditional pdf of Y given that X = x. Then show that for every 
positive integer n, E(Y"IX =x) =E(Y"). [Hint: Make the change of vari- 
able In(y) =u+vn in the second integrand. ] 

(c) Redo (b) with X and Y interchanged. 

(d) The results of (b) and (c) suggest intuitively that X and Y are independent 
rvs. Are they in fact independent? 

Let X,, X2, ... be a sequence of independent, but not necessarily identically 

distributed random variables, and let T=X,+---+X,. Lyapunov’s Theorem 

states that the distribution of the standardized variable (T — p)/o7 converges 

to a N(O, 1) distribution as n — oo, provided that 
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where yu; = E(X;). This limit is sometimes referred to as the Lyapunov condi- 

tion for convergence. 

(a) Assuming E(X;) = yw; and Var(X;) = o;, write expressions for wy and or. 

(b) Show that the Lyapunov condition is automatically met when the Xjs are 
iid. [Hint: Let c= E(IX; — nil), which we assume is finite, and observe that 
T is the same for every X;. Then simplify the limit.] 

(c) Let X;, X2, ... be independent random variables, with X; having an 
exponential distribution with mean i. Show that X,+---+X, has an 
approximately normal distribution as 7 increases. 

(d) An online trivia game presents progressively harder questions to players; 
specifically, the probability of answering the ith question correctly is 1/i. 
Assume any player’s successive answers are independent, and let 
T denote the number of questions a player has right out of the first 7. 
Show that T has an approximately normal distribution for large n. 

This exercise and the next complete our investigation of the Coupon 

Collector’s Problem begun in the book’s Introduction. A box of a certain 

brand of cereal marketed for children is equally likely to contain one of 

10 different small toys. Suppose someone purchases boxes of this cereal one 

by one, stopping only when all 10 toys have been obtained. 

(a) After obtaining a toy in the first box, let Y2 be the subsequent number of 
boxes purchased until a toy different from the one in the first box is 
obtained. Argue that this rv has a geometric distribution, and determine 
its expected value. 

(b) Let Y3 be the number of additional boxes purchased to get a third type of 
toy once two types have been obtained. What kind of a distribution does 
this rv have, and what is its expected value? 

(c) Analogous to Y> and Y3, define Y4, ..., Yjq as the numbers of additional 
boxes purchased to get a new type of toy. Express the total number of 
boxes purchased in terms of the Y;s and determine its expected value. 

(d) Determine the standard deviation of the total number of boxes purchased. 
[Hint: The Y;s are independent. ] 

Return to the scenario described in the previous problem. Suppose an individ- 

ual purchases 25 boxes of this cereal. 

(a) Let X, = 1 if at least one type 1 toy is included in the 25 boxes and X,; =0 
otherwise (a Bernoulli rv). Determine E(X}). 

(b) Define X2, ... , Xi9 analogously to X; for the other nine types of toys. 
Express the number of different toys obtained from the 25 boxes in terms 
of the X;s and determine its expected value. 

(c) What happens to the expected value in (b) as the number of boxes 
purchased increases? As the number of different toys available increases? 

(d) Show that, for i4/, 
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8 25 9 50 


Then determine the variance of the number of different toys obtained 
from the 25 boxes by applying Eq. (4.5) to the expression from part (b). 
[Hint: Refer back to Example 4.19 for the required method. ] 


The Basics of Statistical Inference 


The overarching objective of statistical inference is to draw conclusions (make 
inferences) based on available sample data. In this chapter we generally assume that 
data have been acquired by observing the values of a random sample X,, X2,...,Xp3 
recall from Sect. 4.5 that a random sample consists of rvs that are independent and 
have the same underlying probability distribution (what we also called iid). For 
example, highway fuel efficiency of a certain type of vehicle might have a normal 
distribution with mean yw and standard deviation o. Then each observed fuel 
efficiency value would come from this normal distribution, with the various 
observed values obtained independently of one another—a normal random sample. 
Or the number of blemishes on a new type of DVD might have a Poisson distribu- 
tion with mean value yu. If n of these disks were to be randomly selected and the 
number of blemishes on each one counted, the result would be data from a Poisson 
random sample. In either example, the values of the parameters would typically not 
be known to an investigator. The sample data would then be used to draw some type 
of conclusion about these values. 

In this chapter we introduce several different inferential procedures. The first, 
point estimation, involves using the available data to obtain a single number that 
can be regarded as an educated guess for the value of some parameter (“‘point” 
refers to the fact that a single number corresponds to a single point on a number 
line). Thus we might offer up 31.2 mpg as a sensible estimate of population mean 
fuel efficiency, or 0.8 as an estimate of the true mean number of blemishes per 
DVD. Section 5.1 introduces some general concepts of point estimation and 
methods for assessing the quality of an estimate, while Sect. 5.2 discusses a popular 
method for producing point estimates. 

A point estimate by itself, being a single number, does not provide any informa- 
tion as to how close the estimate might be to the value of the parameter being 
estimated. This deficiency can be remedied by calculating an entire set of plausible 
values for the parameter of interest, called a confidence interval. For example, it 
might be reported with a high degree of confidence—more precisely, a confidence 
level of 95%—that the true average breaking strength of hockey sticks made from a 
certain type of graphite-Kevlar composite is estimated to be between 459.5 and 
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466.2 N. Later in the chapter we consider confidence intervals for a population 
mean and also a population proportion (e.g., the proportion of all college students 
who regularly text during class). 

Rather than estimating the value of some parameter, we may wish to decide 
which of two contradictory claims about the parameter is correct. Suppose, for 
example, that 1,000,000 signatures have been submitted in support of putting a 
particular initiative on a statewide ballot. State law requires that more than 500,000 
of these signatures be valid. If we let p denote the proportion of valid signatures 
among those submitted, then the initiative qualifies if p > .5 and does not qualify if 
p <.5. Because it is extremely tedious and time consuming to check all one million 
signatures, it is customary to select a random sample, determine how many of those 
are valid, and then use the result as a basis for deciding between the two contradic- 
tory hypotheses p >.5 and p<.5. In this chapter we shall consider methods for 
“testing” hypotheses (that is, deciding which of two hypotheses is more plausible) 
about a population mean and also a population proportion. 

Thus far our paradigm for inference has been to regard a parameter such as yu as 
having a fixed but unknown value. A different perspective, referred to as the 
Bayesian method, views any parameter whose value is unknown as being a random 
variable with some type of “prior” probability distribution. Once sample data is 
available, Bayes’ theorem can be used to obtain the “posterior” distribution of the 
parameter conditional on the observed data. Adherents of the Bayesian method of 
inference then use this posterior distribution to draw some type of conclusion about 
the unknown parameter. The last section of this chapter introduces Bayesian 
methodology. 


5.1 Point Estimation 


Recall that a parameter is a numerical characteristic of a probability distribution. 
Often the distribution under consideration furnishes a model for how some variable 
is distributed in a population of interest. Examples include the distribution of yield 
strength values in a population of building-grade steel bars, or the distribution of 
time-to-recovery from a dental anesthetic in the conceptual population of all 
individuals given the anesthetic (we say “conceptual” here because the population 
includes both past and future recipients of the treatment). One parameter in the 
anesthetic scenario is the population mean recovery time y, while in the steel bar 
population an investigator might focus on the parameter 7.95, the 95th percentile of 
the distribution (i.e., the yield strength that separates the strongest 5% of steel bars 
from the other 95%). 

Statistical inference is frequently directed toward drawing some type of 
conclusions about one or more parameters. To do so requires that an investigator 
obtain sample data from the underlying distribution. If the sample consists of 
observations on some random variable X, we will denote the number of sample 
observations (the sample size) by n, the first observation by x, the second by x2, and 
so on, with the last observation represented by x,. The subscripts on x generally 
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have no relationship to the magnitudes of the observations. They are often listed in 
the order in which they were acquired by an investigator. Conclusions (inferences) 
about the population distribution can then be based on the computed values of 
various sample quantities. 


DEFINITION 
A statistic is any random variable whose value can be computed from 
sample data. 


Example 5.1 Zinfandel is a popular red wine varietal produced almost exclusively 
in California. It is rather controversial among wine connoisseurs because its alcohol 
content varies rather substantially from one producer to another. We went to the 
Web site klwines.com, randomly selected 10 from among the 325 available 
zinfandels, and obtained the following values of alcohol content (%): 


xp =148 x2 =14.5 x3=161 x =14.2 x5 =15.9 
x6 = 13.7 x7 =16.2 xg=14.6 % =13.8 xy = 15.0 


Here are examples of some statistics and their values calculated from the 
foregoing data: 
(a) The sample mean X, the arithmetic average of the n observations: 


Y= 


fe OF 
n + n 


We encountered X previously in Sect. 4.5; this is the most frequently used 
measure of center for sample data. The calculated value of the sample mean for 
the given data is 


Sox 148+ 14.5+4+...415.0 148.8 


= 14. 
n 10 10 Re 


4 


Another sample of 10 such wines might yield x = 15.23, and yet another 
givex = 14.70. Prior to obtaining the data, there is uncertainty in what the value 
of the sample mean will be; hence we think of it as a random variable. 

(b) The value of the sample mean can be unduly influenced by even a single 
unusually large or small observation, e.g., a sample of incomes that includes 
Bill Gates, or a sample of cities’ populations that includes Shanghai. An 
alternative measure of center is the sample median X: list the n observations 
in increasing order from smallest to largest; then if m is an odd number, the 
median is the middle value in this ordered list [the (n+ 1)/2th value in from 
either end], and if 7 is even, the median is the average of the two middle values. 
Clearly several extreme values on either end of the ordered list will have no 
impact on the median. The ordered observations in our sample are 
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13.7 13.8 14.2 14.5 14.6 14.8 15.0 15.9 16.1 16.2 


Because n= 10, the calculated value of the sample median is the average of 
the fifth and sixth largest values: * = (14.6 + 14.8)/2 = 14.70. Again, a 
second sample might result in ¥ = 15.15, a third sample in ¥ = 14.95, and so 
on. Prior to obtaining data, there is uncertainty in what value of the sample 
median will result, so the sample median is regarded as a random variable. 

(c) The sample mean and sample median are both assessments of where the sample 
is centered—a typical or representative value. Another important characteristic 
of data is the extent to which the observations spread out about the center. The 
simplest measure of spread (dispersion, variability) is the sample range W: the 
difference between the largest and smallest observations. For our data, 
w= 16.2 —13.7=2.5. A second sample might yield 14.1 and 15.8 as the 
smallest and largest observations, giving w= 1.7. Clearly the value of the 
sample range varies from one sample to another, so before data is available it 
is viewed as a random variable. 

(d) In the context of simulation in Chaps. 2 and 3, we previously introduced 
another measure of variability, the sample standard deviation S: 


1 n a5 
Se Jab (x; — X) 


(The sample variance is defined as s) For our data, the observed value of the 
sample standard deviation is 


= 1 ; afi 2 2 
5= |) (ui — 14.88) = [iss 14.88)? + --- + (15.0 — 14.88) 


i=l 
= 0.915 


As with the previous examples of statistics, this value is particular to our 
data; a different random sample of 10 zinfandel wines might provide s = 1.018, 
another s = 0.882, and so on. Thus, prior to obtaining data, we regard S as a 
random variable. 
(e) Finally, consider the random variable 


ps 
6 //n 


which expresses the distance between the sample mean and its expected value yu 
in standard deviations (e.g., if z=3, then the value of the sample mean is three 
standard deviations larger than would be expected). This rv is not a statistic 
unless the values of and o are known; without those values, the sample does not 
provide enough information to calculate z. a 
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5.1.1. Estimates and Estimators 


When discussing general concepts and methods of inference, it is convenient to 
have a generic symbol for the parameter of interest. We will use the Greek letter 0 
for this purpose. The objective of point estimation is to select a single number, 
based on sample data, that represents a sensible value for 6. Suppose, for example, 
that the parameter of interest is y, the true average lifetime of batteries of a certain 
type. A random sample of n=3 batteries might yield observed lifetimes (hours) 
x; =5.0, x. =6.4, x3=5.9. The computed value of the sample mean lifetime is 
x = 5.77, and it is reasonable to regard 5.77 h as a plausible value of yz, our “best 
guess” for the value of 4 based on the available sample information. 


DEFINITION 

A point estimate of a parameter 0 is a single number that can be regarded as a 
sensible value for 6. A point estimate is obtained by selecting a suitable 
statistic and computing its value from the given sample data. The selected 
Statistic is called the point estimator of 0. 


In the battery scenario just described, the point estimator (i.e., the statistic) 
used to obtain the point estimate of ~ was X, and the point estimate of was 
5.77. If the three observed lifetimes had instead been x, =5.6, x.=4.5, and 
x3 = 6.1, using the same estimator X would have resulted in a different estimate, 
X = (5.64 4.5 + 6.1)/3 = 5.40h. 

The symbol 6 (“theta hat”) is customarily used to denote the point estimate 
resulting from a given sample; we shall also use it to denote the estimator, as using 
an uppercase © is somewhat awkward to write. Thus fi =X is read as “the point 
estimator of y is the sample mean X.” The statement “the point estimate of y is 
5.77 h” can be written concisely as fi = X = 5.77. Notice that in writing a statement 
such as 06 = 72.5, there is no indication of how this point estimate was obtained 
(i.e., what statistic was used). It is recommended that both the estimator and the 
resulting estimate be reported. 


Example 5.2. The National Health and Nutrition Examination Survey (NHANES) 
collects demographic, socioeconomic, dietary, and health-related information on an 
annual basis. Here is a sample of 20 observations on HDL-cholesterol level (mg/dl) 
obtained from the 2009-2010 survey (HDL is “good” cholesterol, and the higher 
the value, the lower the risk for heart disease): 


35 49 52 54 65 51 51 47 86 36 46 33 39 45 39 63 95 35 30 48 


Figure 5.1 shows both a normal probability plot and a brief descriptive summary 
of the data. 
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Fig. 5.1 Normal probability plot and descriptive summary of the HDL sample 


(a) Let’s first consider estimating the population mean HDL level y. The natural 
estimator is of course the sample mean X. The resulting point estimate is 


LA 354 £8 
no 20 


The NHANES data file contained 7846 HDL observations. We could regard 
our sample of size 20 as coming from the population consisting of these 7846 
values. The population mean is then known to be ~=52.6 mg/dl, so our 
estimate of 49.95 is somewhat smaller than the value of the parameter we are 
trying to estimate. We extracted a second sample of size 20 from the popula- 
tion; for this sample, f¢ = X = 57.40, a substantial overestimate of ju. 

(b) Now let’s consider estimating the population median 7, the value that separates 
the smallest 50% of all HDL levels in the population from the largest 50%. The 
natural statistic for estimating this parameter is the sample median X described 
previously. The estimate here is the average of the 10th and 11th values in the 
ordered list of sample observations: 


= 49.95 


Li — i 


4 =x= a = 47.5 mg/dl 

This is somewhat smaller than the sample mean because the sample has 
somewhat of a positive skew—values on the upper end stretch out more than do 
values on the lower end, and these pull the mean rightward compared to the 
median. If for the moment we regard the NHANES data set as constituting the 
population, the population median is 51.0 (again somewhat smaller than the 
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population mean because of a positive skew). Our estimate of 47.5 is also 
smaller than what we are attempting to estimate (51.0). For the second sample 
alluded to in part (a), the sample median was 57.0, an overestimate of the 
population median. 


(c) To estimate the HDL population standard deviation o, it is natural to use the 


sample standard deviation S as our point estimator. The resulting point estimate 
of o is 


1 
Goes am [(35 ~ 49.95)? +... + (48 — 49.95)?] = 16.81 

Roughly speaking, the sample SD describes the size of a typical deviation 
within the sample from the sample mean. A second sample from the same 
population would almost surely give a somewhat different value of s, and thus a 
different point estimate of o. 


(d) An HDL level of at least 60 mg/dl is considered desirable, as it corresponds to a 


significantly lower risk of heart disease. How can we estimate the proportion 
p of the population having an HDL level of at least 60? If we think of a sample 
observation of at least 60 as being a “success,” then a natural estimator of p is 
the sample proportion of successes: 


p= # of successes in the sample 


n 


(We have encountered P several times already, both in the context of 
simulation and of the Central Limit Theorem.) Four of the 20 sample 
observations are at least 60. Thus our point estimate is 6 = 4/20 = .20. That 
is, we estimate that 20% of the individuals in the population have an HDL level 
of at least 60. If a second sample is selected, it may be that 7 of the 
20 individuals have such a level. Use of the same estimator then gives the 
point estimate 7/20=.35. Just as with the other estimators proposed in this 
example, the value of the estimator P will in general vary from one sample to 
another. a 


The foregoing example may have suggested that point estimation is deceptively 


straightforward: once the parameter to be estimated is identified, use intuition to 
specify a suitable estimator (statistic) and then just calculate. However, there are at 
least two major problems with this strategy. The first is that intuition may not be up 
to the task of identifying an estimator. For example, suppose a materials engineer is 
willing to assume (based on subject matter expertise and an appropriate probability 
plot) that the data she collected were sampled from a Weibull distribution. This 
distribution has two parameters, a and /, which appear in the Weibull pdf in a rather 
complicated way. Furthermore, the mean y and standard deviation o both involve 
the gamma function. So the sample mean and sample SD estimate complicated 
functions of the two parameters; it is not at all obvious how to sensibly estimate a 
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and #. In the next section we introduce a constructive method for producing 
estimators that will generally be reliable. 

The second problem with relying solely on intuition is that, in many situations, 
there are two or more estimators for a particular parameter that could sensibly be 
used. For example, suppose an investigator is quite convinced (again, by a combi- 
nation of subject matter expertise and a probability plot) that available data was 
generated by a normal distribution. A major objective of the investigation is to 
estimate the parameter yw. Since is the mean value of the normal population 
distribution, it certainly makes sense to use the sample mean X as its estimator. 
However, because any normal density curve is symmetric, yp is also the median of 
the normal population distribution. It is then sensible to think of using the sample 
median X as an estimator. Two other potential estimators of y are the mid-range, the 
average of the largest and smallest observations, and a trimmed mean, obtained by 
eliminating a specified percentage of the values from each end of the ordered list 
and averaging those values that remain. 

As a second example of competing estimators, consider data resulting from a 
Poisson random sample. This distribution has one parameter, 4, which is both the 
mean and the variance of the Poisson model. So one sensible estimator of yu is the 
sample mean, another is the sample variance, and a third is the average of these two. 
The choice between competing estimators such as these cannot usually be based on 
intuitive reasoning. Instead we need to introduce desirable properties for an estima- 
tor and then try to find one that satisfies the properties. 


5.1.2 Assessing Estimators: Accuracy and Precision 


When a particular statistic is selected to estimate an unknown parameter, two 
criteria often used to assess the quality of that estimator are its accuracy and its 
precision. Loosely speaking, an estimator is accurate if it has no systematic 
tendency, across repeated values of the estimator calculated from different samples, 
to overestimate or underestimate the value of the parameter. An estimator is precise 
if those same repeated values are “close together,” so that two statisticians using the 
same estimator formula (but two different random samples) are liable to get similar 
point estimates. 

The notions of accuracy and precision are made more rigorous by the following 
definitions. 


DEFINITION 

A point estimator 6 is said to be an unbiased estimator of @ if E(0) 

every possible value of 0. If 6 is not unbiased, the difference E (6) —Ois 
called the bias of 0. 


(continued) 
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The standard error of @ is its standard deviation, o; = SD(@). If the 
standard error itself involves unknown parameters whose values can be 
estimated, substitution of these estimates into 05 yields the estimated 


standard error of @. The estimated standard error can be denoted by either 
6; or by So. 


The bias of an estimator 6 quantifies its accuracy be measuring how far, on the 
average, 6 differs from 9. The standard error of 0 quantifies its precision by 


measuring the variability of 6 across different possible realizations (i.e., different 
random samples). It is important to note that both bias and standard error are 


properties of an estimator (the random variable), such as X, and not of any specific 
value or estimate, X. 
Figure 5.2 illustrates bias and standard error for three potential estimators of a 


population parameter @. Figure 5.2a shows the distribution of an estimator 6; whose 
expected value is very close to 8 but whose distribution is quite dispersed. Hence, A, 
has low bias but relatively high standard error. In contrast, the distribution of 6, 


displayed in Fig. 5.2b is very concentrated but is “off target’: the values of 6) across 
different random samples will systematically over-estimate @ by a large amount. 
So, 60> has low standard error but high bias. The “ideal” estimator is illustrated in 


a b 
df of 6, F of 6. 
J ol 0; yi of 6, 


t > 6 
@ 3 


Fig. 5.2 Three potential types of estimators: (a) accurate, but not precise; (b) precise, but not 
accurate; (c) both accurate and precise 
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Fig. 5.2c: 63 has a mean roughly equal to @, so it has low bias, and it also has a 
relatively small standard error. 


Example 5.3 Consider the scenario of Example 5.1, wherein a sample mean X 
from a random sample of n= 10 observations will be used to estimate the popula- 
tion mean alcohol content p of all zinfandel wines. In Sect. 4.5, we showed that the 
expected value and standard deviation of X are » anda/,/n, respectively, where o is 
the population standard deviation (i.e., the SD of the alcohol content of a// zinfandel 
wines). Hence, the bias of X in estimating p is 


E(X) -p=yh-p=0 


That is, X is an unbiased estimator of y. This is true for any random sample and 
for any sample size, n. The standard error of X is simply SD (x) =o6//n=06/V10; 
clearly the precision of X would be improved (i.e., the standard error reduced) by 
increasing the sample size n. 

Since the value of o is almost always unknown, we can estimate the standard 
error of X by Oy =s5/ Jn, where s denotes the sample standard deviation, as we did 
in the context of simulations in Sects. 2.8 and 3.8. For the random sample of 
10 wines presented in Example 5.1, we have a point estimate ff = X = 14.88 with 
an estimated standard error of s//n = 0.915/V/10 = 0.29. The latter indicates that, 
based on the available data, we believe our estimate of p is liable to differ by about 
0.29 from the actual value of p. a 


Example 5.4 Consider once again estimating a population proportion of 
“successes” p (for example, the proportion of all engineering graduates who have 
taken a statistics course, or the proportion of all vehicle accidents in which cell 
phone use was not a factor). The natural estimator of p is the sample proportion of 
successes P = X /n, where X denotes the number of successes in the sample. Using 
the fact that X ~ Bin(n, p), we showed in Sect. 2.4 that the mean and standard error 
of P are 


; ; p(l—p) 
E(P)= d SD(P ) = 4/——— 
(P)=p and sp() = 2 
The first equation tells us that P is an unbiased estimator for p, and that this is 
true no matter the sample size. As for the standard error, since p is unknown (else 
why estimate?), we substitute p = x/ninto On, yielding the estimated standard error 


6; = \/p(1—p)/n. This was used several times in the context of simulation in 
earlier chapters, and we will see this expression again in Sect. 5.5. When n= 25 
and p = .6, this gives 6; = \/(.6)(.4)/25 = .098. Alternatively, since the largest 


P 
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value of p(1 — p) is attained when p=.5, an upper bound on the standard error is 


J(3)(5)/n = 1/2). 1. 


Example 5.5 The time a customer spends in service after waiting in a queue is 
often modeled with an exponential distribution. Recall that the exponential model 
has a single parameter, A, and that the mean of the exponential distribution is 1/A. 
Thus, since A= 1/y, a reasonable estimator of 4 might be 


~ 1 
A == 
Xx 
where X is the average of a random sample of wait times X,, ..., X, from the 


aforementioned single-server queue. How accurate is A as an estimator of 4? How 
precise is it? 
It can be shown (Exercise 11) that the mean and variance of 1 = 1 fb are 


ni 
(n—1)°(n — 2) 


na 


n—- 


and Var(A) = 


The bias of J as an estimator of 2 is therefore E(A ) —A=A/(n— 1). We see that 


A is not an unbiased estimator of A; since A/(n — 1) > 0, we say that A is biased high, 
meaning it will tend to systematically over-estimate 4. Clearly the bias approaches 
0 as n increases. 

The standard error of A is the square root of the variance expression above. It can 
be estimated by replacing the unknown 4 with the calculated value of A, t (ee 
resulting in 


nr 
> 
3 
| 
| el ~ 
nd ~ 
[Ss nN 
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As mentioned before, in some situations more than one estimator might be 
proposed for the same parameter. It is sometimes the case in such scenarios that 
one estimator is more accurate (lower bias) while the other is more precise (smaller 
standard error). Which consideration should prevail? 


Principle of Unbiased Estimation 
When choosing among several different estimators of 0, select one that is 
unbiased. 


According to this principle, the sample mean X would be selected as an estimator 
of a population mean yz over any biased estimator (see Example 5.3), and a sample 
proportion P is preferred over any biased estimator of a true proportion p (Example 
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5.4). In contrast, the estimator A of Example 5.5 is not unbiased; if we can find some 


other estimator of A which is unbiased, we would choose this latter estimator over A. 

If two or more estimators of a parameter are unbiased, then naturally one selects 
the estimator among them with the smallest standard error. For example, we 
previously proposed several different estimators for the mean yw of a normal 
distribution. When the sampled distribution is continuous and symmetric, all four 


of the proposed estimators—the sample mean X, the sample median X, the 
midrange, and a trimmed mean—are unbiased estimators of y (provided yp is finite). 
Using some sophisticated mathematics, it can be shown that when drawing from a 
normal distribution, X has the smallest standard error not only among these four 
estimators but in fact among all unbiased estimators of yi. For this reason, X is 
referred to as the minimum variance unbiased estimator (MVUE) of » when 
sampling from a normally distributed population. 

An alternative approach to the Principle of Unbiased Estimation is to combine 
the considerations of accuracy (bias) and precision (standard error) into a single 
measure, which can be achieved through the mean squared error; see Exercise 22. 
Under this method, the estimator with the smallest mean squared error is selected, 
even if it is biased and other estimators are not. 


5.1.3 Exercises: Section 5.1 (1-23) 


1. A study of children’s intelligence and behavior included the following IQ data 
for 33 first-graders that participated in the study. 


82 96 99 102 103 103 106 107 108 108 108 
108 109 110 110 111 113 113 113 113 115 115 
118 118 119 121 122 122 127 132 136 140 146 


(a) Calculate a point estimate of the mean IQ for the conceptual population of 
all first graders in this school, and state which estimator you used. 

(b) Calculate a point estimate of the IQ value that separates the lowest 50% of 
all such students from the highest 50%, and state which estimator 
you used. 

(c) Calculate and interpret a point estimate of the population standard devia- 
tion o. Which estimator did you use? 

(d) Calculate a point estimate of the proportion of all such students whose IQ 
exceeds 100. [Hint: Think of an observation as a “success” if it exceeds 
100.] 

(e) Calculate a point estimate of the population coefficient of variation, 
100o0/y, and state what estimator you used. 
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2. A sample of 20 students who had recently taken elementary statistics yielded 


the following information on brand of calculator owned (T=Texas 
Instruments, H = Hewlett-Packard, C = Casio, S = Sharp): 


T T H T Cc T T S Cc H 
S S T H Cc T T T H T 


(a) Estimate the true proportion of all such students who own a Texas 
Instruments calculator. 

(b) Of the 10 students who owned a TI calculator, 4 had graphing calculators. 
Estimate the proportion of students who do not own a TI graphing 
calculator. 


. Consider the following sample of observations on coating thickness for 


low-viscosity paint (“Achieving a Target Value for a Manufacturing Process: 
A Case Study,” J. Qual. Technol., 1992: 22-26): 


.83 88 88 1.04 1.09 1.12 1.29 1.31 
1.48 1.49 1.59 1.62 1.65 1.71 1.76 1.83 


Assume that the distribution of coating thickness is normal (a normal proba- 

bility plot strongly supports this assumption). 

(a) Calculate a point estimate of the mean value of coating thickness, and 
state which estimator you used. 

(b) Calculate a point estimate of the median of the coating thickness distribu- 
tion, and state which estimator you used. 

(c) Calculate a point estimate of the value that separates the largest 10% of all 
values in the thickness distribution from the remaining 90%, and state 
which estimator you used. [Hint: Express what you are trying to estimate 
in terms of yu and o.] 

(d) Estimate P(X < 1.5), i.e., the proportion of all thickness values less than 
1.5. [Hint: If you knew the values of 4 and o, you could calculate this 
probability. These values are not available, but they can be estimated. ] 

(e) What is the estimated standard error of the estimator that you used in (b)? 


. The data set mentioned in Exercise 1 also includes these third-grade IQ 


observations for males: 


117 103 121 112 120 132 113 117 132 
149 125 131 136 107 108 113 136 114 


and females: 


114 102 113 131 124 117 120 90 
114 109 102 114 127 127 103 

Prior to obtaining data, denote the male values by X,, .. ., X,,, and the female 
values by Yj, ..., Y,. Suppose that the X;s constitute a random sample from a 


distribution with mean yw, and standard deviation o, and that the Y;s form a 
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random sample (independent of the X;s) from another distribution with mean 

Hz and standard deviation o>. 

(a) Show that X — Y is an unbiased estimator of , — >. Then calculate the 
estimate for the given data. 

(b) Use rules of variance from Chap. 4 to obtain an expression for the 
standard error of the estimator in (a), and then compute the estimated 
standard error. 

(c) Calculate a point estimate of the ratio o;/o> of the two standard deviations. 

(d) Suppose one male third-grader and one female third-grader are randomly 
selected. Calculate a point estimate of the variance of the difference X — Y 
between their IQs. 


. As an example of a situation in which several different statistics could 


reasonably be used to calculate a point estimate, consider a population of 
N invoices. Associated with each invoice is its “book value,” the recorded 
amount of that invoice. Let zt denote the total book value, a known amount. 
Some of these book values are erroneous. An audit will be carried out by 
randomly selecting n invoices and determining the audited (correct) value for 
each one. Suppose that the sample gives the following results (in dollars). 


Invoice 
1 2 3 4 5) 
Book value 300 720 526 200 127 
Audited value 300 520 526 200 157 
Error 0 200 0 0 —30 


Let X = the sample mean book value, Y = the sample mean audited value, 
and D = the sample mean error. Propose three different statistics for 
estimating the total audited (i.e., correct) value €—one involving just N and 
X, another involving N, t, and D, and the last involving t and X/Y. Then 
calculate the resulting estimates when N=5000 and t=1,761,300 (The 
article “Statistical Models and Analysis in Auditing,” Statistical Science, 
1989: 2-33 discusses properties of these estimators.) 


. Consider the accompanying observations on stream flow (thousands of acre- 


feet) recorded at a station in Colorado for the period April 1—August 31 over a 
31-year span (from an article in the 1974 volume of Water Resources Res.). 


127.96 210.07 203.24 108.91 178.21 
285.37 100.85 89.59 185.36 126.94 
200.19 66.24 247.11 299.87 109.64 
125.86 114.79 109.11 330.33 85.54 
117.64 302.74 280.55 145.11 95.36 
204.91 311.13 150.58 262.09 477.08 


94.33 
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An appropriate probability plot supports the use of the lognormal distribu- 
tion (see Sect. 3.5) as a reasonable model for stream flow. 

(a) Estimate the parameters of the distribution. [Hint: Remember that X has a 
lognormal distribution with parameters ~ and o if In(X) is normally 
distributed with mean y and standard deviation o.] 

(b) Use the estimates of part (a) to calculate an estimate of the expected value 
of stream flow. [Hint: What is the expression for E(X)?] 


. (a) Arandom sample of 10 houses in a particular area, each of which is heated 


with natural gas, is selected and the amount of gas (therms) used during 
the month of January is determined for each house. The resulting 
observations are 103, 156, 118, 89, 125, 147, 122, 109, 138, 99. Let wv 
denote the average gas usage during January by all houses in this area. 
Compute a point estimate of yu. 

(b) Suppose there are 10,000 houses in this area that use natural gas for 
heating. Let z denote the total amount of gas used by all of these houses 
during January. Estimate 7 using the data of (a). What estimator did you 
use in computing your estimate? 

(c) Use the data in (a) to estimate p, the proportion of all houses that used at 
least 100 therms. 

(d) Give a point estimate of the population median usage based on the sample 
of (a). What estimator did you use? 


. In a random sample of 80 components of a certain type, 12 are found to be 


defective. 

(a) Give a point estimate of the proportion of all such components that are not 
defective. 

(b) A system is to be constructed by randomly selecting two of these 
components and connecting them in series, as shown here. 


EE 


The series connection implies that the system will function if and only if 
neither component is defective (i.e., both components work properly). 
Estimate the proportion of all such systems that work properly. [Hint: 
If p denotes the probability that a component works properly, how can 
P(system works) be expressed in terms of p?] 

(c) Let p be the sample proportion of successes. Is p? an unbiased estimator 
for p*? [Hint: For any rv Y, E(Y”) = Var(Y)+[E(Y)]*.] 


9. Each of 150 newly manufactured items is examined and the number of 


scratches per item is recorded (the items are supposed to be free of scratches), 
yielding the following data: 
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Number of scratches per item 0 1 2D 3 4 5 6 7 
Observed frequency 18 37 42 30 13 7 2 1 


Let X =the number of scratches on a randomly chosen item, and assume that 

X has a Poisson distribution with parameter p. 

(a) Find an unbiased estimator of » and compute the estimate for the data. 

(b) What is the standard error of your estimator? Compute the estimated 
standard error. [Hint: ox= vu for X Poisson. ] 

Let X,, ..., X, be a random sample from a distribution with mean yp and 

variance o°. 


(a) Show that 37 (X; — X)” = (33x?) — nX”. 

(b) Show that E> X?) =n +0”). [Hint: Use linearity of expectation, 
along with the relation E(Y*) = Var(Y) + [E(Y)]*.] 

(c) Show that E (nX’) = nw’? + 0°. [Hint: Apply the relation given in the 
previous hint, but this time to Y = X.] 

(d) Combine parts (a)—(c) to show that S? is an unbiased estimator of 07. 


(e) Does it follow that the sample standard deviation, S, of arandom sample is 
an unbiased estimator of o? Why or why not? 


Example 5.5 considered the estimator 4 = 1 /X for the unknown parameter 4 
of an exponential distribution, based on a random sample X,, X2, ..., X,, from 
that distribution. 

(a) Show using a moment generating function argument that X has a gamma 
distribution, with parameters a=n and f= 1/(nd). 

(b) Find the expected value of a [Hint: The goal is to find E(1/Y), where 
Y~gamma(n, 1/(nd)). Use the gamma pdf to determine this expected 
value. ] 

(c) Find the variance of A. [Hint: Now find E(1/Y = Then apply the variance 
shortcut formula. ] 

Using a long rod that has length yw, you are going to lay out a square plot in 

which the length of each side is yw. Thus the area of the plot will be Ww. 

However, you do not know the value of mw, so you decide to make 

n independent measurements X1, X2, ... X, of the length. Assume that each 

X; has mean yw (unbiased measurements) and variance o. 


(a) Show that X’ is not an unbiased estimator for Ww. [Hint: Apply the hint 
from Exercises 8 and 10 with Y = X.] 


(b) For what value of k is the estimator x —kS? unbiased for we? (Hint: 
Compute E (x° - ks?) , using the result of Exercise 10(d).] 


Of n, randomly selected male smokers, X, smoked filter cigarettes, whereas of 
Nn randomly selected female smokers, X, smoked filter cigarettes. Let p, and 
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P2 denote the probabilities that a randomly selected male and female, respec- 

tively, smoke filter cigarettes. 

(a) Show that (X,/n,) — (X2/n2) is an unbiased estimator for p; — p2. [Hint: 
What type of rvs are X, and X>?] 

(b) What is the standard error of the estimator in (a)? 

(c) How would you use the observed values x; and x2 to estimate the standard 
error of your estimator? 

(d) If ny =n2 = 200, x; = 127, and x2 = 176, use the estimator of (a) to obtain 
an estimate of p; — po. 

(e) Use the result of (c) and the data of (d) to estimate the standard error of the 
estimator. 

Suppose a certain type of fertilizer has an expected yield per acre of yz, with 

variance o”, whereas the expected yield for a second type of fertilizer is 

H2 with the same variance o. Let Sj and $3 denote the sample variances of 

yields based on sample sizes n, and no, respectively, of the two fertilizers. Use 

the result of Exercise 10(d) to show that the pooled (combined) estimator 


42 _ (= DS} + (m — 1)53 


ny +n. —2 
is an unbiased estimator of o7. 
Consider a random sample X, ..., X,, from the pdf 
f(x;0) = .5(1 + Ax) -1l<x<l 


where —1 <0 < 1 (this distribution arises in particle physics). Show that 9 = 

3X is an unbiased estimator of 0. [Hint: First determine p = E(X) = E(X).] 

A sample of n captured jet fighters results in serial numbers x), x2, 3, ..., Xn- 

The CIA knows that the aircraft were numbered consecutively at the factory 

starting with a and ending with /, so that the total number of planes 

manufactured is f—a+l1 (e.g., if a= 17 and P= 29, then 29—17+1=13 
planes having serial numbers 17, 18, 19, ..., 28, 29 were manufactured). 

However, the CIA does not know the values of a or #. A CIA statistician 

suggests using the estimator max(X;) — min(X;) + | to estimate the total num- 

ber of planes manufactured. 

(a) If n=5, x; = 237, x. = 375, x3 = 202, x4=525, and x5 = 418, what is the 
corresponding estimate? 

(b) Under what conditions on the sample will the value of the estimate be 
exactly equal to the true total number of planes? Will the estimate ever be 
larger than the true total? Do you think the estimator is unbiased for 
estimating # — a+ 1? Explain in one or two sentences. 

(A similar method was used to estimate German tank production in World 
War II.) 

Let X,, Xo, ..., X,, represent a random sample from a Rayleigh distribution 

with pdf 
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f(x; 0) = oe x>0 


(a) It can be shown that E(X’) = 20. Use this fact to construct an unbiased 
estimator of 0 based on bee te (and use rules of expected value to show 
that it is unbiased). 

(b) Estimate 6 from the following measurements of blood plasma beta con- 
centration (in pmol/L) for n = 10 men. 


16.88 10.23 4.59 6.66 13.68 
14.23 19.87 9.40 6.51 10.95 


Suppose the true average growth y of one type of plant during a 1-year period 
is identical to that of a second type, but the variance of growth for the first type 
is o°’, whereas for the second type the variance is 4o*. Let X ip deceeg es 
be m independent growth observations on the first type [so E(X) =n, 
Var(X;) =o’), and let Y;, ..., Y,, be m independent growth observations on 
the second type [E(Y;) =p, Var(Y;) = Ao”). 

Let c be a _ numerical constant and consider the estimator 
fi = cX + (1 —c)Y. For any c between 0 and 1, this is a weighted average 
of the two sample means, e.g., .7X + .3Y. 

(a) Show that for any c the estimator is unbiased. 

(b) For fixed m and n, what value c minimizes the standard error of i? [Hint: 
The estimator is a linear combination of the two sample means and these 
means are independent. Once you have an expression for the variance, 
differentiate with respect to c.] 

In Chap. 2, we defined a negative binomial rv as the number of trials required 

to achieve the rth success in a sequence of independent and identical success/ 

failure trials. The probability mass function (pmf) of X is 


x-1 


nb(x;r,p) = ( _] ac —py" x=rr+1,r+2,... 
otherwise 


(a) Suppose that r > 2. Show that 


P =(r—-1)/(X-1) 


is an unbiased estimator for p. [Hint: Write out E (P ) as a sum, then make 
the substitutions y=x— 1 and s=r-—1.] 
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(b) A reporter wishing to interview five individuals who support a certain 
candidate begins asking people whether (S) or not (F’) they support the 
candidate. If the sequence of responses is SFFSFFFSSS, estimate p = the 
true proportion who support the candidate. 


20. Suppose that X, the reaction time to a stimulus, has a uniform distribution on 


21. 


the interval from 0 to an unknown upper limit @. An investigator wants to 
estimate 0 on the basis of a random sample X,, X2, ..., X,, of reaction times. 
Consider two possible estimators: 


6, = max(X1,...,Xn) 6, = 2X 
(a) The following observed reaction times, in seconds, are for a sample of 
n=5 subjects: x, =4.2, x9 = 1.7, x3=2.4, x4 =3.9, x5 = 1.3. Calculate a 
point estimate of 0 based on 6, anda point estimate of @ based on A. 


(b) The techniques of Sect. 4.9 imply that the pdf of 6; is f(y)=ny" 1/0” 
for 0<y<6 (we’re using y as the argument instead of 6; so that the 


notation is less confusing). Use this to obtain the mean and variance of 61. 

(c) Is 6 an unbiased estimator of 0? Explain why this is reasonable. [Hint: If 
the population maximum is 0, what must be true of the sample 
maximum?] 

(d) The mean and variance of a uniform distribution on [0, 4] are 0/2 and 
0/12, respectively. Use these and the properties of X to find the mean and 
variance of >. 

(e) If a statistician elected to apply the Principle of Unbiased Estimation, 
which estimator would she select? Why? 

(f) Find a constant k such that 0, =k-6 1 is unbiased for 0, and compare the 
standard error of 0; to the standard error of 6». 

An investigator wishes to estimate the proportion of students at a certain 

university who have violated the honor code. Having obtained a random 

sample of n students, she realizes that asking each, “Have you violated the 
honor code?” will probably result in some untruthful responses. Consider the 
following scheme, called a randomized response technique. The investigator 

makes up a deck of 100 cards, of which 50 are of Type I and 50 are of Type IL. 

Type I: Have you violated the honor code (yes or no)? 

Type II: Is the last digit of your telephone number a 0, 1, or 2 (yes or no)? 

Each student in the random sample is asked to mix the deck, draw a card, and 
answer the resulting question truthfully. Because of the irrelevant question on 

Type II cards, a yes response no longer stigmatizes the respondent, so we 

assume that responses are truthful. Let p denote the proportion of honor-code 
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violators (i.e., the probability of a randomly selected student being a violator), 

and let 2 = P(yes response). Then 4 and p are related by A= .5p + (.5)(.3). 

(a) Let Y denote the number of yes responses, so Y ~ Bin(n, 4). Thus Y/n is an 
unbiased estimator of A. Derive an estimator for p based on Y. If n= 80 
and y = 20, what is your estimate? [Hint: Solve A = .5p + .15 for p and then 
substitute Y/n for A.] 

(b) Use the fact that E(Y/n) = A to show that your estimator is unbiased for p. 

(c) If there were 70 Type I and 30 Type II cards, what would be your 
estimator for p? 


The mean squared error of an estimator 6 is defined by 
MSE(4) = E[( -6)'| 


(a) Show that MSE(@) = [E(0) = ae + Var(0) by expanding out the 
quadratic expression “inside” the expected value operation in the defini- 
tion of MSE and then using linearity of expectation. 

(b) If 6 is an unbiased estimator of the parameter 0, how does MSE@) simplify? 

(c) Refer back to Example 5.5. Determine the mean squared error of the 
estimator 1 using the mean and variance expressions provided in that 
example. 

(d) Consider an alternative estimator, hes defined by 


a n—-1 n-1 I n-—I1; 


aS a y 7 


Obtain the mean, variance, and MSE of Ae [Hint: Use rescaling 
properties. ] 


(e) Which of the two estimators, 4 or d,, is preferable? Explain your 
reasoning. 

Return to the problem of estimating a population proportion p, and consider 

the following adjusted estimator: 


~ X+ 4/n/4 
p, =v 
n+J/n 


The justification for this estimator comes from the Bayesian approach to 
point estimation to be introduced in Sect. 5.6. 
(a) Determine the mean, variance, and mean squared error of this estimator. 
What do you find interesting about this MSE? 
(b) Compare the MSE of this estimator to the MSE of the usual estimator P 
(the sample proportion). 
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5.2 Maximum Likelihood Estimation 


The point estimators introduced in Sect. 5.1 were obtained via intuition and/or 
educated guesswork. We now introduce a “constructive” method for obtaining 
point estimators: the method of maximum likelihood. By constructive we mean 
that the general definition of a maximum likelihood estimator suggests explicitly 
how to obtain the estimator in any specific problem. 

The method of maximum likelihood was first introduced by R. A. Fisher, a 
geneticist and statistician, in the 1920s. Most statisticians recommend this method, 
at least when the sample size is large, since the resulting estimators have certain 
desirable efficiency properties (see the proposition on large-sample behavior 
toward the end of this section). 


Example 5.6 The best protection against hacking into an online account is to use a 
password that has at least eight characters containing upper and lower case letters, a 
numeral, and a special character. Suppose that ten individuals who have a certain 
type of email account are selected, and it is found that the first, third, and tenth 
individuals have such strong protection, whereas the others do not (the January 
2012 issue of Consumer Reports reported that only 25% of individuals surveyed 
used a strong password). Let p = P(strong protection), i.e., p is the proportion of all 
account holders having strong protection. Define Bernoulli random variables X,, 
Xo, meag X10 by 

Ki { 1 if the ith person has strong protection = 1,2, ...10 

0 if not 


Then for the obtained sample, X; = X3 =X j9= 1 and the other seven X;s are all 
zero. The probability mass function of any particular X; is p*i(1 — py which 
becomes p if x;=1 and 1—p when x;=0. Finally, the strengths of various 
passwords are presumably independent of one another, so that the X;s are indepen- 
dent and their joint probability mass function is the product of the individual pmfs. 
Thus the joint pmf evaluated at the observed Xjs is 


p-(1—p)-p:(1—p)-(1—p)---p=pr(1—p)’ (5.1) 


Suppose that p=.25. Then the probability of observing the sample that we 
actually obtained is (.25)°(.75)’ = .002086. If instead p = .50, then this probability 
is (.50)°(.50)’ = .000977. For what value of p is the obtained sample most likely to 
have occurred? That is, what value of p maximizes the joint pmf (Eq. 5.1)? 
Figure 5.3 shows a graph of the likelihood (Eq. 5.1) as a function of p. It appears 
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Fig. 5.3. Likelihood and log likelihood plotted against p 


that the graph reaches its peak above p=.3, which is the proportion of strong 
passwords in the sample. The second figure shows a graph of the natural logarithm 
of (Eq. 5.1); since In[g(u)] is a strictly increasing function of g(u), finding u to 
maximize the function g(u) is the same as finding u to maximize In[g(u)]. 

We can verify our visual impression by using calculus to find the value of p that 
maximizes (Eq. 5.1). Working with the natural log of the joint pmf is often easier 
than working with the joint pmf itself, since the joint pmf is typically a product so 
its logarithm will be a sum. Here 


In ae -p)'| = 3In(p) + 7In(1 — p) 
Thus 


Sine — ay] =< [sine + Tin P| =3+ )=5 i 


The (— 1) comes from the chain rule in calculus. Equating this derivative to 0 and 
solving for p gives 3(1—p)=7p, from which 3= 10p and so p=3/10=.30 as 
conjectured. That is, our point estimate is p = .30. It is called the maximum 
likelihood estimate because it is the parameter value that maximizes the likelihood 
(joint pmf) of the observed sample. In general, the second derivative should be 
examined to make sure a maximum has been obtained, but here this is obvious from 
the figure. 

Suppose that rather than being told the condition of every password, we had only 
been informed that three of the ten were strong. Then we would have the observed 
value of a binomial random variable X = the number of strong passwords. The pmf 
of X is (12) p*(1 — p)'°. For x=3, this becomes (2) p3(1 — p)’. The binomial 
coefficient ea is irrelevant to the maximization, so the value of p that maximizes 
the likelihood of observing X = 3 is again p = .30. = 
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DEFINITION 

Let X,, ..., X, have a joint distribution (i.e., a joint pmf or pdf) that depends 
on a parameter 0 whose value is unknown. Suppose that we observe X; = x1, 
X2=X2, ..., X,=X,y. Substitute the observed data into the joint distribution 
and regard it as a function of 0, called the likelihood function and denoted by 
L(@). Then the maximum likelihood estimate 0 is the value of @ that 
maximizes the likelihood, so that L(6) > L(@) for every possible value of 


@. Replacing the x;s in @ by X;s gives the maximum likelihood estimator 
(mle) of 0. 


In Example 5.6, the joint pmf of X,, ..., Xjo became p*(1—p)’ once the 
observed values of the X;s were substituted. So, the likelihood function would be 
written L(p)=p°(1 —p)’. If we take the perspective that our data consists of a 
single binomial observation, then L(p) = ('9) p3(1 — p)'. In either case, the value 
of p that maximizes L(p) is p = .3. 

The likelihood function tells us how likely the observed sample is as a function 
of the possible parameter values. Maximizing the likelihood gives the parameter 
value for which the observed sample is most likely to have been generated, that is, 
the parameter value that “agree most closely” with the observed data. Maximizing 
the likelihood is equivalent to maximizing the logarithm of the likelihood, and the 
latter is typically computationally more straightforward. 


Example 5.77 Suppose X), ..., X, is a random sample from an exponential 
distribution with parameter 1. Because of independence, the likelihood function is 
a product of the individual pdfs: 


[Gig ige S ea le) = Le =) 


The log of the likelihood function is 


In[L(A)] = nin(a) — AS" x; 


Equating (d/d/)In[L(A)] to zero results in n/A — Sox;=0, or A= n/ Sox; = 1/2. 
Thus the mle is 4 = 1 /X. As we saw in Example 5.5, A is unfortunately not an 
unbiased estimator, since E(1/X) # 1/E(X). a 
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Example 5.8 In Chap. 2, we indicated that the Poisson distribution could be used 
for modeling the number of events of some sort that occur in a two-dimensional 
region (e.g., the occurrence of tornadoes during a particular time period). Assume 
that when the region R being sampled has area a(R), the number X of events 
occurring in R has a Poisson distribution with mean 4 - a(R), where A is the expected 
number of events per unit area, and that nonoverlapping regions yield independent 
Xs. (This is called a spatial Poisson process.) 

Suppose an ecologist selects n nonoverlapping regions R, ..., R,, and counts the 
number of plants of a certain species found in each region. The joint pmf (likeli- 
hood) is then 


-alR,)\2 —/.a(R1) -a(Ry Xn ,—A-a(Rn) 
P(X1,---,Xnj4) = A a(Ri)J"e ee a(Rn)]'"e 
x! Xn! 
Ry} +--+ fa(R,)y™ - Lx, ,-ALa(Ri) 
_ [a(R] [a(Rn)}"" «A> +e = 
Alsaceegel 


The log-likelihood is 


In[L(A)] = S— {xilnfa(Ri)]} + In(d) - Sox; — AS © a(R) — $7 ni!) 
Taking (d/d4)In[L(A)] and equating it to zero yields 


pai eres 
lar ae a a a 


The mle is then 1 = > X;,/ S>a(R;). This is intuitively reasonable because A is 
the true density (plants per unit area), whereas A is the sample density: )' X; is the 
number of plants counted, and >’ a(R,) is just the total area sampled. Because 
E(X;)=2- a(R;), the estimator is unbiased. 

Sometimes an alternative sampling procedure is used. Instead of fixing regions 
to be sampled, the ecologist will select n points in the entire region of interest and 
let y; = the distance from the ith point to the nearest plant. The cdf of Y = distance to 
the nearest plant is 


Fy(y) =P(Y <y) =1-P(Y >y) =1-P( 


ey” (Any?)° 
0! 


no plants in a 
circle of radius y 
—Any” 


=1 =l-e 


Taking the derivative of F'y(y) with respect to y yields 


5.2 Maximum Likelihood Estimation 453 
2ndye*Y" y>0 
Ay= ai 
Folia) { 0 otherwise 


If we now form the likelihood L(A) =fy); 4): : -fyO,; 4), differentiate In[L()], 
and so on, the resulting mle is 


2 n __ number of plants observed 
a, ¥ total area sampled 


which is also a sample plant density. It can be shown that in a sparse environment 
(small A), the distance method is in a certain sense better, whereas in a dense 
environment, the first sampling method is better. a 


The definition of maximum likelihood estimates can be extended in the natural 
way to distributional families that include two or more parameters. The mles of 


parameters 0;,...,0,, are those values 61, ...,9, that maximize the likelihood 
function L(Q1,.. ., Ayn). 


Example 5.9 Let X,,..., X,, be arandom sample from a normal distribution, which 
includes the two parameters yw and o. The likelihood function is 


1 2 2 1 2 2 
. = —(n-#)’/(207) oo. —(n—-#)"/ (207) 
X1,.- +s Xnj fo e e 
Pn #0) Vv 210? Vv 210? 
so 


ny 


1 
5 In(2n) nino agi 2 (i uy 


In{L(y,0)] = 


To find the maximizing values of y and o, we must take the partial derivatives of 
In(L) with respect to both yz and o, equate them to zero, and solve the resulting two 
equations: 


0 
SlnlLa,2)] = 55 lw -)(-1) = si w) = 0 
F in((u,0)] = -24 5 —p)?=0 


The first equation implies that \(4;—)=0, from which }\x;— n=O and 
finally « = >x;/n = X. The mle of y is the sample mean, independent of what the 
mle of o turns out to be. Solving the second equation for o yields 


o = \/Y(x; —n)°/n; we must then substitute the solution from the first equation 
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into this expression in order to get the simultaneous solution to the two partial 
derivative equations. Thus the maximum likelihood estimators of the two 
parameters are 


p=X =: 


6= 
n 


Notice that the mle of o is not the sample standard deviation, S, since the 
denominator in the latter is nm — 1 and not n. | 


Example 5.10 Let X;, ..., X,, be a random sample from a Weibull pdf 


2 ge gl asey 


fanaa. 7 


0 otherwise 


Writing the likelihood L and log-likelihood In[Z], then setting both partial 
derivatives (0/0a)[In(L)] =0 and (0/0f)[In(L)] = 0 yields the equations 


So [xf In(xi)] S7 In(xi) =I ae es 1/a 


These two equations cannot be solved explicitly to give general formulas for the 


mles @ and B . Instead, for each sample x), ..., x,, the equations must be solved 
using an iterative numerical procedure. 

The iterative mle computations can be done using statistical software. In Matlab, 
the command wb1fit (x) will return @ and B assuming the data is stored in the 
vector x. The corresponding R command is fitdistr(x, "weibull") 
performs the same estimation (the MASS package must be installed first). As an 
example, consider the following survival time data alluded to in Example 3.28: 


152 115 109 94 88 137 152 a, 160 165 
125 40 128 123 136 101 62 153 83 69 


A Weibull probability plot supports the plausibility of assuming that survival 
time has a Weibull distribution. The maximum likelihood estimates of the Weibull 
parameters are @ = 3.799 and B = 125.88. Figure 5.4 shows the Weibull log 
likelihood as a function of both @ and f. The surface near the top has a rounded 
shape, allowing the maximum to be found easily, but for some distributions the 
surface can be much more irregular, and the maximum may be hard to find. 
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Fig. 5.4 Weibull log likelihood for Example 5.10 a 


Sometimes calculus cannot be used to obtain mles. 


Example 5.11 Suppose the waiting time for a bus is uniformly distributed on [0, 6] 
and the results x1, ..., x, of a random sample from this distribution have been 
observed. Since f(x; 8) = 1/0 for 0<x <4 and 0 otherwise, 


1 

= 05x <6,...,05% <0 
f(1,---5%nj 0) = < A 

0 otherwise 


As long as max(x;) < 9, the likelihood is 1/0”, which is positive, but as soon as 
6 < max(x;), the likelihood drops to 0. This is illustrated in Fig. 5.5. Calculus will 
not work because the maximum of the likelihood occurs at a point of discontinuity, 


but the figure shows that 6 = max(.x;). Thus if the waiting times are 2.3, 3.7, 1.5, .4, 
and 3.2, then the mle is 6 = 3.7. Note that the mle is biased (see Exercise 20(b)). 


Likelihood 


ay 


max(x;) 


Fig. 5.5 The likelihood function for Example 5.11 a 
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5.2.1 Some Properties of MLEs 


In Example 5.9, we obtained the mle of o when the underlying distribution is 
normal. The mle of o”, as well as many other mles, can be easily derived using the 
following proposition. 


PROPOSITION (MLE INVARIANCE PRINCIPLE) 


Let 61, by, Breas Bn be the mles of the parameters 0, 02,...,0,,. Then the mle 
of any function h(@;,02,...,9,.) of these parameters is the function 


n(01, 02, are On of the mles. 


Proof For an intuitive idea of the proof, consider the special case m=1, with 
6, =9, and assume that h(-) is a one-to-one function. On the graph of the likelihood 


as a function of the parameter 0, the highest point occurs where 0 = 4. Now 
consider the graph of the likelihood as a function of h(@). In the new graph the 
same heights occur, but the height that was previously plotted at 0=a is now 


plotted at A(0) = h(a), and the highest point is now plotted ath(@) = h (6 ). Thus, the 


maximum remains the same, but it now occurs at h(o 5. | 


Example 5.12 (Example 5.9 continued) In the case of a random sample from a 


normal pdf, the mles of w and o are fi = X and 6 = S- (X; - X)"/n. To obtain 


the mle of the function h(y, o) = o”, substitute the mles into the function: 


The mle of o° is not the unbiased estimator (the sample variance S*; see Exercise 
10), although they are close unless 7 is quite small. Similarly, the mle of the 
population coefficient of variation, h(u,o) = 100 y/o, is 100fi /6. a 


Example 5.13 (Example 5.10 continued) The mean value of an rv X that has a 
Weibull distribution is 


y= B-T(1+1/a) 


The mle of is therefore fi = 8 - (1 + 1/@), where @ and f are the mles of a 
and f. In particular, X is not the mle of y, although it is an unbiased estimator. At 
least for large n, fi is a better estimator than X. a 


The method of maximum likelihood estimation has considerable intuitive 
appeal. The following proposition provides additional rationale for the use of mles. 
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PROPOSITION 

Under very general conditions on the joint distribution of the sample, when the 

sample size is large, the maximum likelihood estimator of any particular 0 

¢ Is highly likely to be close to @ (consistency); 

¢ Is either unbiased or at least approximately unbiased [E (6) x 6; and 

¢ Has variance that is either as small or nearly as small as can be achieved by 
any unbiased estimator. 


Because of this result and the fact that calculus-based techniques can usually be 
used to derive the mles (although often numerical methods, such as Newton— 
Raphson, are necessary), maximum likelihood estimation is the most widely used 
estimation technique among statisticians. Obtaining an mle, however, does require 
that the underlying distribution be specified. For example, the mle of the mean 
value of a Weibull distribution is different from the mle of the mean value of a 
Gamma distribution. 

Suppose X, Xz, ..., X, is arandom sample from a pdf f(x; @) that is symmetric 
about 0, but the investigator is unsure of the form of the f function. It is then 
desirable to use an estimator that is robust, that is, one that performs well for a wide 
variety of underlying pdfs. One such estimator, called an M-estimator, is based ona 
generalization of maximum likelihood estimation. Instead of maximizing the 
log-likelihood 5~ In[f(x; @)] for a specified f, one maximizes 5* w(;; 0), where 
the “objective function” y is selected to yield an estimator with good robustness 
properties. The book by David Hoaglin et al. (see the references) contains a good 
exposition on this subject. 


5.2.2 Exercises: Section 5.2 (24-36) 


24. Let X represent the error in making a measurement of a physical characteristic 
or property (e.g., the boiling point of a particular liquid). It is often reasonable 
to assume that E(X) =0 and that X has a normal distribution. Thus the pdf of 
any particular measurement error is 


_ 1 —x /20 
fa)= se 

where @ denotes the population variance. Now suppose that n independent 
measurements are made, resulting in measurement errors X; =x), X2=X2,..., 
Xn) =Xp- 

(a) Determine the likelihood function of 0. 

(b) Find and simplify the log-likelihood function. 

(c) Differentiate (b) to determine the mle of 0. 

(d) The precision of a normal distribution is defined to be t= 1/0. Find the 

mle of t. 
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29% 


26. 


27. 


28. 


29. 
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A random sample of n bike helmets manufactured by a company is selected. 

Let X=the number among the n that are flawed, and let p= P(flawed). 

Assume that only X is observed, rather than the sequence of Ss and Fs. 

(a) Derive the maximum likelihood estimator of p. If n = 20 and x = 3, what is 
the estimate? 

(b) Is the estimator of (a) unbiased? 

(c) Ifn=20 and x =3, what is the mle of the probability (1 — py that none of 
the next five helmets examined is flawed? 

Let X denote the proportion of allotted time that a randomly selected student 

spends working on a certain aptitude test. Suppose the pdf of X is 


[0+ O<x<1 
ne) = { 0 otherwise 


where —1 <6. A random sample of ten students yields data x; = .92, x. =.79, 
x3 = .90, x4 = .65, x5 = .86, %6 = .47, x7 = .73, Xg = .97, X= 94, X19 =.77. 
Obtain the maximum likelihood estimator of 0, and then compute the 
estimate for the given data. 
Two different computer systems are monitored for a total of n weeks. Let X; 
denote the number of breakdowns of the first system during the ith week, and 
suppose the X;s are independent and drawn from a Poisson distribution with 
parameter jz,. Similarly, let Y; denote the number of breakdowns of the second 
system during the ith week, and assume independence with each Y; Poisson 
with parameter 2. Derive the mles of fj, fo, and fy — fo. [Hint: Using 
independence, write the joint pmf (likelihood) of the X;s and Y;s together. ] 
Six Pepperidge Farm bagels were weighed, yielding the following data 
(grams): 


117.6 109.5 111.6 109.2 119.1 110.8 


(a) Assuming that the six bagels are a random sample and that weights are 
normally distributed, estimate the true average weight and standard devi- 
ation of the weight using maximum likelihood. 

(b) Again assuming a normal distribution, estimate the weight below which 
95% of all bagels will have their weights. [Hint: What is the 95th 
percentile in terms of and o? Now use the invariance principle.] 

(c) Suppose we choose another bagel and weigh it. Let X= weight of the 
bagel. Use the given data to obtain the mle of P(X < 113.4). [Hint: 
PX < 113.4) = @[((113.4 — p)/o}].] 

Refer to Exercise 25. Instead of selecting n = 20 helmets to examine, suppose 

we examine helmets in succession until we have found r= 3 flawed ones. If 

the 20th helmet is the third flawed one, what is the mle of p? Is this the same as 
the estimate in Exercise 25? Why or why not? 
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30. Let Xi, ..., X, be a random sample from a gamma distribution with 


31. 


32. 


33. 


34. 


parameters a and f. 

(a) Derive the equations whose solution yields the maximum likelihood 
estimators of a and f#. Do you think they can be solved explicitly? 

(b) Show that the mle of w=af is f = X. 

Let X,, Xo, ..., X,, represent a random sample from the Rayleigh distribution 

with density function given in Exercise 17. 

(a) Determine the maximum likelihood estimator of 0 and then calculate the 
estimate for the vibratory stress data given in that exercise. Is this estima- 
tor the same as the unbiased estimator suggested in Exercise 17? 

(b) Determine the mle of the median of the vibratory stress distribution. 
[Hint: First express the median 7 in terms of 0.] 

Consider a random sample Xj, Xo, ..., X, from the shifted exponential pdf 


fe) ae 

IOA@) = { 0 otherwise 

Taking @=0 gives the pdf of the exponential distribution considered previ- 

ously (with positive density to the right of zero). An example of the shifted 

exponential distribution appeared in Example 3.5, in which the variable of 
interest was time headway in traffic flow and 6 = .5 was the minimum possible 
time headway. 

(a) Obtain the maximum likelihood estimators of @ and 4. 

(b) If n= 10 time headway observations are made, resulting in the values 
3.11, .64, 2.55, 2.20, 5.44, 3.42, 10.39, 8.93, 17.82, and 1.30, calculate the 
estimates of @ and A. 

The article “A Model of Pedestrians’ Waiting Times for Street Crossings at 

Signalized Intersections” (Transportation Research, 2013: 17-28) suggested 

that under some circumstances the distribution of waiting time X could be 

modeled with the following pdf: 


ag —x/t)?" O<x<t 
f(x30,7) = 4 T where 0 > 0 


0) otherwise 


(a) Suppose we observe a random sample of waiting times X,, ..., X,, and 
assume that the value of the parameter 7 is known. Find the mle of 0. 

(b) Suppose instead that @ is known but z is unknown. Determine an equation 
whose solution is the mle of 7. 

Twenty identical components are simultaneously tested. The lifetime distri- 

bution of each is exponential with parameter 4. The experimenter then leaves 

the test facility unmonitored. On his return 24 h later, the experimenter 

immediately terminates the test after noticing that y = 15 of the 20 components 

are still in operation (so 5 have failed). Derive the mle of 4. [Hint: Let Y= the 

number that survive 24 h. Then Y~Bin(n, p). What is the mle of p? Now 


460 5 The Basics of Statistical Inference 


notice that p = P(X; > 24), where X; is exponentially distributed. This relates 4 

to p, so the former can be estimated once the latter has been.] 

35. Consider randomly selecting n segments of pipe and determining the corro- 
sion loss (mm) in the wall thickness for each one. Denote these corrosion 
losses by Y;, ..., Y,,. The article “A Probabilistic Model for A Gas Explosion 
Due to Leakages in the Grey Cast Iron Gas Mains” (Reliability Engr. and 
System Safety 2013:270-279) proposes a linear corrosion model Y;=¢,R;, 
where ft; is the age of the pipe and R;, the corrosion rate, is exponentially 
distributed with parameter A. Obtain the maximum likelihood estimator of 4 
(the resulting mle appears in the cited article). [Hint: First determine the pdf of 
Y;.] 

36. A method that is often used to estimate the size of a wildlife population 
involves performing a capture/recapture experiment. In this experiment, an 
initial sample of M animals is captured, each of these animals is tagged, and 
the animals are then returned to the population. After allowing enough time 
for the tagged individuals to mix into the population, another sample of size 
nis captured. With X = the number of tagged animals in the second sample, 
the objective is to use the observed x to estimate the population size N. 

(a) What is the probability distribution of X? 

(b) Set L(V) equal to the distribution specified in (a); this is the likelihood 
function. Since N can only assume integer values, using calculus to 
maximize L(N) would present difficulties. Instead, determine the mle of 
N be considering the ratio L(N)/L(N — 1). [Hint: the mle can be found by 
determining when this ratio is greater than 1 or less than 1 (do you see 
why?).] 

If 200 fish are taken from a lake and tagged, then subsequently 100 fish are 

recaptured and among the 100 there are 11 tagged fish, what is the mle of 

the size of the fish population in this lake? Does your answer make 
intuitive sense? 


(c 


wm 


5.3 Confidence Intervals for a Population Mean 


A point estimate, because it is a single number, by itself provides no information 
about the precision and reliability of estimation. Consider, for example, using the 
statistic X to calculate a point estimate for the true average breaking strength of 
paper towels of a certain brand, and suppose that a particular random sample yields 
X = 9322.7g. Because of sampling variability, it is virtually never the case that 
X = p, and the point estimate alone says nothing about how close it might be to pw. 
An alternative to reporting a single sensible value for the parameter being estimated 
is to calculate and report an entire interval of plausible values—an interval estimate 
or confidence interval (CI). 

A confidence interval is always calculated by first selecting a confidence level, 
which is a measure of the degree of reliability of the interval. A confidence interval 
with a 95% confidence level for the true average breaking strength might have a 
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lower limit of 9162.5 g and an upper limit of 9482.9 g. Then at the “95% confidence 
level,” any value of pz between 9162.5 and 9482.9 is plausible. A confidence level of 
95% implies that 95% of all samples would give an interval that includes y, or 
whatever other parameter is being estimated, and only 5% of all samples would 
yield an erroneous interval. The most frequently used confidence levels are 95, 99, 
and 90%. The higher the confidence level, the more strongly we believe that the 
value of the parameter being estimated lies within the interval. 

Information about the precision of an interval estimate is conveyed by the width 
of the interval. If the confidence level is high and the resulting interval is quite 
narrow, our knowledge of the value of the parameter is reasonably precise. A very 
wide confidence interval, however, gives the message that there is a great deal of 
uncertainty concerning the value of what we are estimating. Figure 5.6 shows 95% 
confidence intervals for true average breaking strengths of two different brands of 
paper towels. One of these intervals suggests precise knowledge about yz, whereas 
the other suggests a very wide range of plausible values. 


Brand 1: G Strength 


Brand 2: ¢ } Strength 


Fig. 5.6 Confidence intervals indicating precise (brand 1) and imprecise (brand 2) information 
about py 


5.3.1. A Confidence Interval for a Normal Population Mean 


For much of this section we will assume that the available data results from a 
random sample X,, X2, ..., X, selected from a normal population distribution. The 
plausibility of assuming a normal population distribution can of course be checked 
by examining a normal probability plot of the data. Particularly when the sample 
size is small, the confidence interval to be developed here should not be used if the 
plot shows a substantial departure from a linear pattern. We'll comment later on 
what might be done in the presence of non-normality. 

Recall from the previous chapter that as a consequence of our normality 
assumption, the sample mean X also is normally distributed, with mean value p 
(the mean of the population from which the sample was selected) and standard 
deviation o/,/n. We now standardize X to obtain a random variable having a 
standard normal distribution: 


ate 
2 ahi 


Unfortunately this standardized variable cannot serve as a basis for deriving a 
confidence interval for » unless the value of the population standard deviation o 
happens to be known. So instead let’s consider the standardized variable obtained 
by replacing o in Z by the sample standard deviation S. Define a new random 
variable T by 


462 5 The Basics of Statistical Inference 


pa 
S/n 


It is important to contrast the behavior of Z in repeated sampling with that of T. 
The only variability in Z from one sample to another is because X in the numerator 
varies in value. However, there are two sources of sample-to-sample variability in 
T: both X in the numerator and S in the denominator. Because of this extra variation 
in T, it stands to reason that the distribution of T should be more spread out than that 
of Z. That is, the density curve for T should be more spread out than the standard 
normal curve. 

At this point we need to introduce a new (to the reader) family of probability 
distributions that describes how T varies from one sample to another. This is the 
family of ¢ distributions. The formula for the density function that specifies a 
t distribution is quite complicated (see the reference by Devore and Berk, where the 
formula and a derivation appear). Fortunately for our purpose we need only be 
acquainted with some general properties. 


PROPERTIES OF T DISTRIBUTIONS 


1. Any particular ¢ distribution is obtained by specifying the value of a single 
parameter v, called the number of degrees of freedom (df) of the distribu- 
tion. Any positive integer is a possible value of v, so there is a f distribution 
with 1 df, another with 2 df, and so on. 

2. Each t, density curve is bell shaped and centered at 0, just like the standard 

normal (z) curve. 

Each t, density curve is more spread out than the z curve. 

4. As v increases, the spread of the t, curve decreases (so the ¢, curve is the 
most spread out, the ft) curve is next most spread out, and so on). 

5. As v— oo, the sequence of ¢, curves approaches the z curve (for this 
reason, the z curve is often called the ¢ curve with df= oo). 


es) 


Figure 5.7 shows several different ¢ density curves and the z curve to illustrate 
how the curves compare and change as df increases. 

Appendix Table A.5 displays what are called ¢ critical values; these are numbers 
on the horizontal axis that capture certain central areas under ¢ curves. For example, 
looking down the left column to v= 24 and then over to the column headed 95%, 
we learn that 95% of the area under the ¢ curve with 24 df lies between —2.064 and 
2.064. Notice that in any particular column of the table, the numbers decrease as we 
move down; this is because the spread of f curves decreases as df increases. And the 
numbers in any row increase from right to left because a larger central area is being 
captured. Also note that toward the bottom of the table df skips from 30 to 40 to 
60 to 120 to oo. Once past 30 df, the ¢ curves do not change all that much, so it is not 
worth continuing to tabulate in increments of 1 df. For an intermediate number of 
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Fig. 5.7. Comparison of AG) 
several ¢ curves and the A 
z curve 


SE 


df, linear interpolation can be used to get a reasonable approximation, or appropri- 
ate software will produce an exact value. Lastly, the ¢ distribution with an infinite 
number of df is actually the standard normal distribution. Thus the bottom row of 
the table contains standard normal critical values; for example, 95% of the area 
under the z curve lies between — 1.96 and 1.96. 

With information about ¢ distributions in hand, we are now ready for the key 
theoretical result on which our confidence interval will be based. This result was 
originally discovered in 1908 by William Sealy Gosset, a statistician at the 
Guinness Brewery in Dublin, Ireland. 


GOSSET’S THEOREM 

Let X,, ..., X, be a random sample from a normal population distribution 
having mean pv, with corresponding sample mean X and sample standard 
deviation S. Then the random variable 


ea 
S/n 


has a ¢ distribution with n — 1 degrees of freedom. 


T 


An intuitive justification for degrees of freedom here is that although there are 
n deviations X, — X,X, — X, ...,X, — X from the sample mean, it is easily verified 
that >> (X = X) = 0. This implies that any particular one of the deviations can be 
obtained from the other m — | deviations. For example, in the case n=5, if the first 
four deviations are —2, 5, 1, and —8, then the last one must be 4 to produce a sum 
of zero. The number of df here is the number of “freely varying” deviations that 
are inputs to the sample standard deviation. 
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Consider for the moment a sample size of n= 25, for which the standardized 
variable T is based on 24 df. Then 95% of the area under this ¢ curve lies between 
—2.064 and 2.064. The foregoing theorem then allows us to make the following 
probability statement: 


X—yp 
P| —2.064 < < 2.064 | = .95 
( 8/25 


Let’s now manipulate the inequalities inside the parentheses to isolate w in the 
middle. This requires three steps: (1) multiply all three terms by S/,\/n, (2) subtract 
X from all three terms, and (3) multiply through by —1 to eliminate the negative 
sign in front of w. The last step will reverse the direction of each inequality, 
resulting in 


rae 2.0648 rs 2.0648 
v5” V25 
These new inequalities are completely equivalent to those in the original proba- 
bility statement, so 


— 2.0648 — 2.0648 
P — — << X 4+ — ] = 95 
( Ja" V25 ) 


To interpret this latter probability, think of obtaining sample after sample of size 
25 from a normal population distribution; calculate the sample mean and sample 
standard deviation for each one, and then form the lower limit x — 2.064s/ J25 and 
the upper limit ¥ + 2.064s/ 25. Both the center of the interval (%) and its width will 
vary from sample to sample. In the long run, 95% of such samples will result in the 
value of 4 being captured in between the lower limit and the upper limit—the long- 
run capture percentage for the sequence of intervals is 95%. Any particular one of 
these intervals is called a confidence interval for with confidence level 95%. 

Generalizing the foregoing derivation for an arbitrary sample size leads to the 
following confidence interval formula. 


ONE-SAMPLE T CONFIDENCE INTERVAL 

Let x and s be the sample mean and sample standard deviation of a random 
sample of size n selected from a normal population distribution. Then a 
confidence interval (interval of plausible values) for the population mean yp 
has endpoints 


=| 
u 
TT 
a 
* 


where ¢* is the appropriate ¢ critical value with n — 1 df from Table A.5. 
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Example 5.14 Have you ever dreamed of owning a Porsche? Even though aca- 
demic salaries leave little room for luxuries, the authors thought maybe the pur- 
chase of a used Boxster, the least expensive Porsche model, might be feasible. So 
on December 30, 2012 we went to www.cars.com to peruse prices. The news was 
discouraging, so we instead selected a random sample of 16 such vehicles and 
obtained the following odometer readings (miles): 


1445 25,822 26,892 29,860 35,285 47,874 49,544 64,763 
72,698 75,732 84,457 91,577 93,000 109,538 113,399 137,652 


Figure 5.8 shows a normal probability plot of the data; this version includes a 
superimposed line which makes it easier to judge whether the pattern in the plot is 
reasonably linear. Very clearly that is the case. It is therefore quite plausible that the 
distribution of odometer readings is (at least approximately) normal, which 
validates the use of the one-sample ¢ confidence interval to estimate the population 
mean odometer reading, p. 

The sample mean and sample standard deviation are 66,221.1 and 37,683.1672, 
respectively, and the (estimated) standard error of the mean is s/./n = 9420.7918. 
Table A.5 shows that the ¢ critical value for a confidence level of 95% when 
df= 16 — 1=15 is f* =2.131. The confidence interval is then 


sl 
u 


Er Fa = 66,221.1 + (2.131)(9420.7918) = 66,221.1 + 20,075.7 
= (46,145.4, 86,296.8) 


That is, we can say with a confidence level of 95% that 46,145.4 < uw < 86,296.8. 


Percent 
wn 
| 
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Fig. 5.8 Normal probability plot of the Boxster odometer reading data 
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Note that it is not correct at this point to write P(46,145.4 < p< 86,296.8) = .95, 
because nothing inside the parentheses is random. The interval we have calculated 
may or may not include the actual value of yz. If we were to obtain sample after sample 
of size 16 from this population distribution and for each one use the given formula with 
t* = 2.131, in the long run 95% of the calculated CIs would include 4 whereas 5% 
would not. Without knowing the value of #, we can’t know whether the particular 
interval we have calculated is one of the “good” 95% or the “bad” 5%. a 


5.3.2 A Large-Sample Confidence Interval for 


When the sample size n is sufficiently large, the Central Limit Theorem says that X 
has approximately a normal distribution even when the population distribution is 
not normal. Furthermore, it can be shown in this case that the standardized variable 
(X — ) /(S/,/n) has approximately a standard normal distribution; using S in place 
of o in the denominator does not appreciably increase the variability of Z when n is 
large. This in turn implies that for large n, a legitimate confidence interval for a 
population mean yp is 


+2*.— 5.2 

x a (5.2) 

where the z critical values for the most frequently employed confidence levels 

appear in the bottom row of Appendix Table A.5 (or can be extracted from the 

z table). For example, the z critical value for 95% confidence, the most common 
level used in practice, is z* = 1.96. 


Example 5.15 Magnetic resonance imaging is a commonly used noninvasive 
technique for assessing the extent of cartilage damage. However, there is concern 
that the MRI sizing of articular cartilage defects may not be accurate. The article 
“Preoperative MRI Underestimates Articular Cartilage Defect Size Compared with 
Findings at Arthroscopic Knee Surgery” (Amer. J. of Sports Med., 2013: 590-595) 
reported on a study involving a sample of 92 cartilage defects. For each one, the size 
of the lesion area was determined by an MRI analysis and also during arthroscopic 
surgery. Each MRI value was then subtracted from the corresponding arthroscopic 
value to obtain a difference value; this is commonly referred to as “paired differ- 
ence” data. The sample mean difference was calculated to be 1.04 cm’, with a 
sample standard deviation of 1.67. Let’s now calculate a confidence interval using a 
confidence level of (at least approximately) 95% for jp, the mean difference for the 
population of all such defects (as did the authors of the cited article). Using the 
z* = 1.96 and Expression (5.2), the CI is 


. 
iO i06= = a Ba = (.70, 1.38) 


V92 
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At the 95% confidence level, we conclude that .70 < wp < 1.38. Perhaps the most 
important aspect of this interval is that 0 is not included; only certain positive values 
of ftp are plausible. It is this fact that led the investigators to conclude that MRIs 
tend to underestimate defect size. 2] 


Many statisticians do not use Expression (5.2) unless their sample size is extremely 
large, electing instead to use the one-sample f¢ interval for virtually all cases. In 
Example 5.15, for instance, z* = 1.96 would be replaced by the more conservative 
t critical value at 91 df, which happens to be 1.986. This would make very little 
practical change to the resulting interval. In the simulation sections of Chaps. 2-4, 
where the “sample” size was typically 10,000 or more, there would be no controversy 
in usingx + 1.96s/,/nas a 95% Cl for the unknown mean p of the rv being simulated. 

When the sample size is small and the population distribution is substantially 
non-normal, neither the one sample ¢ interval nor Expression (5.2) should be used. 
In this case there are other techniques for obtaining a valid CI. One relatively 
recent, computationally intensive such method is called a bootstrap confidence 
interval. This entails obtaining a large number of samples of size n by resampling 
with replacement from the sample that was actually obtained—e.g., if the sample 
size is 20, a bootstrap might be based on 1000 samples of size 20, each obtained 
with replacement from the original sample. Details can be found in the book by 
Devore and Berk listed in the references. 


5.3.3 Software for Confidence Interval Calculation 


It should be no surprise that modern software can compute confidence intervals 
automatically once we have supplied the software with our data. In R, the t. test 
function takes in a vector of data and returns, among other things, a one-sample 
t 95% confidence interval for the population mean yp. The optional argument 
conf.level can be used to select any other confidence level (the default is 
conf.level=.95). The analogous function in Matlab is ttest, although the 
inputs and outputs are managed differently. Both are illustrated in Fig. 5.9. 

Notice that both R and Matlab give a CI of (46,147, 86,303) for the true mean 
odometer reading. This is roughly what we computed in Example 5.14, and the 
disparity is primarily due to rounding in the critical value ¢*. The other information 
provided by R relates to hypothesis testing, which we will discuss in Sect. 5.4. 

To simply find the ¢ critical value for a particular df, the inverse cdf commands 
can be implemented, but with one proviso: for the central area of a t curve to equal 
some confidence level C, the cumulative probability from —oo to the critical value 
must be 


1-C_1+€ 


Crs 5) 


e.g., for 95% confidence, C = .95, and the cumulative probability is (1 + .95)/2 = .975. 
In Matlab, the command icdf(’t’,.975,15) returns 2.1314, the f¢ critical 
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a 


>> x=[1445,25882,26892,29860, 35285, 
47874,49544, 64763, 72698, 75732, 84457, 
91577, 93000,109538,113399,137652]; 
>> [~,~,CI]=ttest (x) 


46147 86303 
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b 


> x<-c (1445, 25882,26892,29860, 35285, 
47874,49544, 64763, 72698, 75732, 

84457, 91577, 93000, 109538, 113399,137652) 
> t.test (x) 


One Sample t-test 
data: x 
t = 7.0305, df = 15, 
p-value = 4.068e-06 
alternative hypothesis: 
not equal to 0 
95 percent confidence interval: 

46147.22 86302.53 


true mean is 


Fig. 5.9 One-sample ¢ intervals for using the data in Example 5.14: (a) Matlab; (b) R 


value at 15 df that we used in Example 5.14. In R, qt (.975,15) gives this same 


value. 


To construct the interval (Eq. 5.2) for a population mean jy, use the command 
ztest in Matlab; the z-based CI for y is not implemented in the R base package. 


5.3.4 Exercises: Section 5.3 (37-50) 


37. Determine the f critical value for a one-sample ¢t confidence interval in each of 


the following situations. 


(a) Confidence level = 95%, df = 10 
(b) Confidence level = 95%, df = 15 
(c) Confidence level = 99%, df= 15 


(d) Confidence level = 99%, n=5 


(e) Confidence level = 98%, df = 24 
(f) Confidence level = 99%, n= 38 


38. 


According to the article “Fatigue Testing of Condoms” (Polymer Testing, 


2009: 567-571), “tests currently used for condoms are surrogates for the 
challenges they face in use,” including a test for holes, an inflation test, a 
package seal test, and tests of dimensions and lubricant quality (all fertile 
territory for the use of statistical methodology!). The investigators developed 
a new test that adds cyclic strain to a level well below breakage and 
determines the number of cycles to break. A sample of 20 condoms of one 
particular type resulted in a sample mean number of 1584 and a sample 
standard deviation of 607. Calculate and interpret a confidence interval at 
the 99% confidence level for the true average number of cycles to break. 
[Note: The article presented the results of hypothesis tests based on the 
t distribution; the validity of these depends on assuming normal population 


distributions. ] 
39. 


Here is a sample of ACT scores (average of the Math, English, Social Science, 


and Natural Science scores) for students taking college freshman calculus: 
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40. 


41. 


42. 


24.00 28.00 27.75 27.00 24.25 23.50 26.25 
24.00 25.00 30.00 23.25 26.25 21.50 26.00 
28.00 24.50 22.50 28.25 21.25 19.75 


(a) Using an appropriate graph, see if it is plausible that the observations were 
selected from a normal distribution. 
(b) Calculate a 95% confidence interval for the population mean. 
(c) The university ACT average for entering freshmen that year was about 21. 
Are the calculus students better than average, as measured by the ACT? 
Even as traditional markets for sweetgum lumber have declined, large section 
solid timbers traditionally used for construction bridges and mats have 
become increasingly scarce. The article “Development of Novel Industrial 
Laminated Planks from Sweetgum Lumber” (J. of Bridge Engr., 2008: 64-66) 
described the manufacturing and testing of composite beams designed to add 
value to low-grade sweetgum lumber. Here is data on the modulus of rupture 
(psi; the article contained summary data expressed in MPa): 


6807.99 7637.06 6663.28 6165.03 6991.41 6992.23 
6981.46 7569.75 7437.88 6872.39 7663.18 6032.28 
6906.04 6617.17 6984.12 7093.71 7659.50 7378.61 
7295.54 6702.76 7440.17 8053.26 8284.75 7347.95 
7422.69 7886.87 6316.67 7713.65 7503.33 7674.99 


(a) Verify the plausibility of assuming a normal population distribution. 

(b) Estimate the true average modulus of rupture in a way that conveys 
information about precision and reliability. 

A sample of 26 offshore oil workers took part in a simulated escape exercise, 

resulting in the accompanying data on time (seconds) to complete the escape 

(“Oxygen Consumption and Ventilation During Escape from an Offshore 

Platform,” Ergonomics, 1997: 281-292): 


389 356 359 363 375 424 325 394 402 
373 373 370 364 366 364 325 339 393 
392 369 374 359 356 403 334 397 


(a) Calculate a 99% confidence interval for the population mean escape time. 
(b) Would a 90% CI based on the same data be wider or narrower? Explain. 
A study of the ability of individuals to walk in a straight line (“Can We Really 
Walk Straight?” Amer. J. Phys. Anthropol., 1992: 19-27) reported the 
accompanying data on cadence (strides per second) for a sample of n= 20 
randomly selected healthy men. 


93 .85 92 .95 93 .86 1.00 92 85 81 
78 93 93 1.05 93 1.06 1.06 .96 81 .96 
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43. 


44. 


45. 


46. 
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A normal probability plot gives substantial support to the assumption that 

the population distribution of cadence is approximately normal. Calculate and 
interpret a 95% confidence interval for population mean cadence. 
The article “Measuring and Understanding the Aging of Kraft Insulating 
Paper in Power Transformers” (/EEE Electrical Insul. Mag., 1996: 28-34) 
contained the following observations on degree of polymerization for paper 
specimens for which viscosity times concentration fell in a certain middle 
range: 


418 421 421 422 425 427 431 
434 437 439 446 447 448 453 
454 463 465 


(a) Is it plausible that the given sample observations were selected from a 
normal distribution? 

(b) Calculate a 95% confidence interval for true average degree of polymeri- 
zation (as did the authors of the article). Does the interval suggest that 
440 is a plausible value for true average degree of polymerization? What 
about 450? 

Silicone implant augmentation rhinoplasty is used to correct congenital nose 

deformities. The success of the procedure depends on various biomechanical 

properties of the human nasal periosteum and fascia. The article “Biomechan- 

ics in Augmentation Rhinoplasty” (J. of Med. Engr. and Tech., 2005: 14-17) 

reported that for a sample of 15 (newly deceased) adults, the mean failure 

strain (%) was 25.0, and the standard deviation was 3.5. Assuming a normal 
distribution for failure strain, estimate true average strain in a way that 
conveys information about precision and reliability. 

A more extensive tabulation of ¢ critical values than what appears in this book 

shows that for the ¢ distribution with 20 df, the areas to the right of the values 

.687, .860, and 1.064 are .25, .20, and .15, respectively. What is the confidence 

level for each of the following three confidence intervals for the mean yp of a 

normal population distribution? Which of the three intervals would you 

recommend be used, and why? 

(a) (% — .687s/V21,x + 1.725s/V21) 

(b) (¥ — .860s/V/21,% + 1.325s/V/21) 

(c) (¥— 1.064s/V/21,%+ 1.064s/V/21) 

In many applications, it suffices to have a reliable lower bound for the mean yp, 

because underestimating 4 would be far more serious that overestimating 

it. This gives rise to the idea of a lower confidence bound for yu: a quantity 

L so that we can say with 95% confidence (for example) that L < p. 
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47. 


48. 


(a) Let ¢* be a value such that 


X— Hv * 
°(Se<") = .95 
Manipulate the inequality inside the parentheses to isolate mw, and 
conclude that L = x — f*(s/,/n) is a 95% lower confidence bound for p. 
(b) Notice that the expression given in (a) specifies that the area from —oo to 
t* under the appropriate ¢ curve is .95; equivalently, the upper tail area 
designated by r* is .05. What is the appropriate ¢ critical value for a 95% 
lower confidence bound with df=10? df= 15? [Hint: Do not use the 
header row in the ¢ table as a reference; those confidence levels refer to 
central areas, or equivalently two-sided confidence intervals. ] 
(c) A sample of 14 joint specimens of a particular type gave a sample mean 
proportional limit stress of 8.48 MPa and a sample standard deviation of 
.79 MPa (“Characterization of Bearing Strength Factors in Pegged Timber 
Connections,” J. Struct. Engr., 1997: 326-332). Assuming the data are 
drawn from a normally distributed population, calculate and interpret a 
95% lower confidence bound for the true average proportional limit stress 
of all such joints. 
An upper confidence bound, U, for 1 is obtained by replacing the — sign witha 
+ sign in the expression for L from the previous exercise: U = X + f*(s/,/n). 
As in the previous exercise, the f critical value is determined by a one-tail area, 
not a central area. 
Consider the following sample of fat content (in percentage) of n= 10 


randomly selected hot dogs (“Sensory and Mechanical Assessment of the 
Quality of Frankfurters,” J. Texture Stud., 1990: 395-409): 


25.2 21.3 22.8 17.0 29.8 21.0 25.5 16.0 20.9 19.5 


Assuming that these were selected from a normal population distribution, 
calculate and interpret a 99% upper confidence bound for the population mean 
fat content. 

When the sample size n is very large, lower and upper confidence bounds for yu 

can be obtained by replacing ¢* with z* in the expressions from the previous 

two exercises. For example, a large-sample lower confidence bound for y is 
given by L = X — z*s/,/n, where z* satisfies the relation P(Z < z*) =c when 

Z~N(0, 1) and 100c% is the prescribed confidence level (e.g., c= .95). 

(a) Show that the z critical value for a one-sided (i.e., upper or lower) confi- 
dence bound for jz, with confidence level 100c%, is given by z*=®7 '(c). 

(b) Find the one-sided z critical values for 90, 95, and 99% confidence. 

(c) A certain random variable is simulated 10,000 times. The sample mean 
and standard deviation of the resulting 10,000 values are 41.63 and 8.05, 
respectively. Calculate and interpret a 95% lower confidence bound for 
the true expected value of this rv. 
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49. Often an investigator wishes to predict a single value of a variable to be 
observed at some future time, rather than to estimate the mean value of that 
variable. Suppose we will observe the values of a random sample Xj, ..., X, 
from a normal population with mean y and standard deviation o, and from 
these we wish to predict the value of a future independent observation X,,,1. 
(a) Show that 


Z- x — Xnit 
/ 1 
oy/1+— 
n 
has a standard normal distribution, where X is the sample mean of Xj, ..., 
X,,. Hint: Since the population is normal, the linear combination X — X;,+1 
is normal. Show that X — X,, has mean 0 and variance o(1 + 1/n), then 
standardize. | 
(b) If we replace o with the sample standard deviation S in the expression for 
Z from (a), it can be shown that the resulting quantity has a ¢ distribution 
with n — 1 df. Use this fact and a derivation similar to the one presented in 
this section to show that a prediction interval (PI) for a single future 
observation X,,,; is given by 


/ 1 
xXtf-sy/l+— 
n 


(c) Use the previous expression, along with the data in Exercise 47, to provide 
a 95% prediction interval for the fat content of a randomly selected hot 
dog you will consume at some future time. 

50. Independent observations X, ..., X,~N(1, 61) and Yq, ..., Yin~N(H2, 62) 
will be taken. For example the heights of m men and m women might be 
recorded, where the height distribution of each gender is normally distributed 
but with unknown parameters. Of interest is the difference between the two 
unknown population means, f4; — fo. 


(a) The logical estimator of ,— pl is X — Y, the difference of the sample 
averages of the two samples. By determining the mean and variance of 
X — Y, show that 


2 2 
AL eee 
nom 


has a standard normal distribution. 

(b) Let z* be the z critical value such that P(—z* < Z< z*) =c, where Z is as 
above and 100c% is the prescribed confidence level. Rewrite the 
inequalities to provide a confidence interval for the difference of popula- 
tion means py — fo. 
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(c) If m and m are large (say, > 40), replacing o; and o> with S; and S> (the 
two sample standard deviations) under the square root in (a) adds little 
extra variability; the resulting standardized variable still has approxi- 
mately a standard normal distribution. Make this substitution to obtain a 
large-sample z confidence interval for 4, — 2 that can be implemented in 
practice. 

(d) The article “Gender Differences in Individuals with Comorbid Alcohol 
Dependence and Post-Traumatic Stress Disorder” (Amer. J. Addiction, 
2003: 412-423) reported the accompanying data in total score in the 
Obsessive-Compulsive Drinking Scale. 


Gender Sample size Sample mean Sample SD 
Male 44 IQ). 7.74 
Female 40 16.26 7.58 


Calculate and interpret a 95% confidence interval for the difference in the 
true mean scores for males and females. 


5.4 Testing Hypotheses About a Population Mean 


We have seen that a parameter can be estimated from sample data, either by a single 
number (a point estimate) or an entire interval of plausible values (a confidence 
interval). Frequently, however, the objective of an investigation is not to estimate a 
parameter but to decide which of two contradictory claims about the parameter is 
correct. Methods for accomplishing this comprise the part of statistical inference 
called hypothesis testing. 


5.4.1 Hypotheses and Test Procedures 


A statistical hypothesis, or just hypothesis, is a claim or assertion either about the 

value of a single parameter (population characteristic or characteristic of a proba- 

bility distribution), about the values of several parameters, or about the form of an 

entire probability distribution. Examples include 

¢ The claim that » = $350, where y is the true average one-term textbook expen- 
diture for students at a university 

¢ The assertion that p < .50, where p is the proportion of children who have a food 
allergy of some sort 

¢ The claim that 4; — 2 > 3, where j/, is the true average fuel efficiency (mpg) of 
all current model year Honda Accords equipped with a 4-cylinder engine and 
{2 is the analogous characteristic for Accords equipped with a 6-cylinder engine. 
In any hypothesis-testing problem, there are two contradictory hypotheses under 

consideration. One hypothesis might be the claim y = $350 and the other p 4 $350, 
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or the two contradictory statements might be p > .50 and p < .50. The objective is to 
decide, based on sample information, which of the two hypotheses is correct. There 
is a familiar analogy to this in a criminal trial. One claim is the assertion that the 
accused individual is innocent. In the US judicial system, this is the claim that is 
initially believed to be true. Only in the face of strong evidence to the contrary 
should the jury reject this claim in favor of the alternative assertion that the accused 
is guilty. In this sense, the claim of innocence is the favored or protected hypothesis, 
and the burden of proof is placed on those who believe in the alternative claim. 

Similarly, in testing statistical hypotheses, the problem will be formulated so 
that one of the claims is initially favored. This initially favored claim will not be 
rejected in favor of the alternative claim unless sample evidence contradicts it and 
provides strong support for the alternative assertion. 


DEFINITION 

The null hypothesis, denoted by Hp, is the claim that is initially assumed to 
be true (the “prior belief” claim). The alternative hypothesis, denoted by H,, 
is the assertion that is contradictory to Ho. 

The null hypothesis will be rejected in favor of the alternative hypothesis 
only if sample evidence suggests that Ho is false. If the sample does not 
strongly contradict Ho, we will continue to believe in the plausibility of the 
null hypothesis. The two possible conclusions from a hypothesis-testing 
analysis are then reject Ho or fail to reject Ho. 


A test of hypotheses is a method for using sample data to decide whether the 
null hypothesis should be rejected. Thus we might test Ho: «= 350 against the 
alternative H,: 4 #350. Only if sample data strongly suggests that yw is something 
other than 350 should the null hypothesis be rejected. In the absence of such 
evidence, Hp should not be rejected since it is still judged to be plausible. 

Sometimes an investigator does not want to accept a particular assertion unless 
and until data can provide strong support for the assertion; in that situation, this 
assertion will be the investigator’s alternative hypothesis H,. As an example, 
suppose a company is considering putting a new additive in the dried fruit that it 
produces. The true average shelf life with the current additive is known to be 
200 days. With y denoting the true average shelf life with the new additive, the 
company would not want to make a change unless evidence strongly suggested that 
pe exceeds 200. An appropriate problem formulation would involve testing Hp: 
= 200 against H,: 4 > 200. The conclusion that a change is justified is identified 
with H,, and it would take conclusive evidence to justify rejecting Ho and switching 
to the new additive. 

Scientific research often involves trying to decide whether a current theory 
should be replaced by a more plausible and satisfactory explanation of the phenom- 
enon under investigation. A conservative approach is to identify the current theory 
with Ho and the researcher’s alternative explanation with H,. Rejection of the 
current theory will then occur only when evidence is much more consistent with 
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the new theory. In many situations, H, is referred to as the “research hypothesis,” 
since it is the claim that the researcher would really like to validate. The word null 
means “of no value, effect, or consequence,” which suggests that Hp should be 
identified with the hypothesis of no change (from current opinion), no difference, 
no improvement, and so on. Suppose, for example, that 10% of all computer circuit 
boards produced by a manufacturer during a recent period were defective. An 
engineer has suggested a change in the production process in the belief that it 
will result in a reduced defective rate. Let p denote the true proportion of defective 
boards resulting from the changed process. Then the research hypothesis, on which 
the burden of proof is placed, is the assertion that p< .10. Thus the alternative 
hypothesis is H,: p< .10. 

In our treatment of hypothesis testing, Ho will generally be stated as an equality 
claim. When the parameter of interest is a population mean p, the null hypothesis 
will have the form Ho: 4 = fo, where fg is a specified number called the null value 
(value claimed for y by the null hypothesis). For example, let 4 represent the true 
average breaking strength of nylon string of a certain type. If a particular applica- 
tion requires that exceed 100 N and the string will not be used unless there is 
compelling evidence that this is the case, the natural alternative hypothesis is H,: 
p> 100. It would then make sense to select as the null hypothesis the assertion that 
pt < 100. However, we will instead simplify the null hypothesis to Ho: v= 100. The 
rationale for using this simplified null hypothesis is that any reasonable decision 
procedure for deciding between Ho: w = 100 and H,: « > 100 will also be reasonable 
for deciding between the claim that « < 100 and H,, and should lead to exactly the 
same conclusion for any particular sample. The use of a simplified Ho is preferred 
because it has certain technical benefits, which will be apparent shortly. 

The alternative to the null hypothesis Hp: “=p will look like one of the 
following three assertions: 

1. H,: > "o Gin which case the implicit null hypothesis is p< po) 
2. Ha: [<M (So the implicit null hypothesis states that p> po) 
3. Ha: HA Mo 


5.4.2 Test Procedures for Hypotheses About a Population Mean pu 


The decision as to whether Hp should be rejected is based on the analysis of data x), 
X2,...,X, resulting from a random sample of the population. A sensible strategy at 
this point would be to calculate the sample mean X and reject the null hypothesis if 
its value is too far from jp in the appropriate direction. For example, in the scenario 
involving breaking strength of nylon string, a value of x considerably larger than 
100 would suggest that Ho is false and should be rejected. But an X value /ess than 
100 would not incline us to reject Ho in favor of H,, since a sample mean less than 
100 would certainly not convince us that the population mean yp is more than 100. 

Rather than base a decision on X itself, let’s standardize X assuming that the null 
hypothesis is true: 
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_ X= Uo 


s/n 


If we knew the value of the population standard deviation o, we'd use it rather 
than the sample standard deviation s, but in practice this is almost never the case. 
Continuing with the nylon string scenario, t = (x — 100) /(s/,/n). For n=25 and 
sample data ¥ = 108.5, s= 12.14, we calculate t= 8.5/2.428 = 3.50. The interpre- 
tation is that the value of the sample mean, 108.5, is 3.5 estimated standard errors 
from what we’d expect it to be if the null hypothesis were true. In general, t is the 
distance between the sample mean and what we’d expect it to be if Ho were true, 
expressed in standard deviations. 

Now let’s see if we can identify which values of ¢ are at least as contradictory to 
Ho as the value calculated from the available sample data. Again focusing on the 
nylon string situation, because the alternative hypothesis states that the population 
mean exceeds 100, any value of X greater than 108.5 argues even more strongly 
against Hp than does the 108.5 resulting from our sample. And any X greater than 
108.5 corresponds to a value of f that exceeds 3.50. So values of ¢ that are at least 
3.50 are at least as contradictory to Ho as 3.50 itself. 

As another example, now suppose that y represents the mean IQ for a large 
population of children, and consider the rival hypotheses Hp: «= 100 and H,: 
H# 100. Because 100 is the generally accepted value of mean IQ in the USA, the 
alternative hypothesis here states that the average for the designated population of 
children is different from this accepted value. Suppose a sample of 225 children 
gives a sample mean IQ of 98.6 and a sample standard deviation of 16.15, from 
which t = (98.6 — 100)/(16.15//225) = —1.30. The average IQ in the sample is 
1.3 estimated standard errors smaller than what would be expected were the null 
hypothesis true. To decide which values of ¢ are at least as contradictory to Ho as 
—1.30, first consider which values of X are at least as contradictory to Hp as 98.6. 
Not only is any value 98.6 or smaller in this category, but also any value that is at 
least 101.4—that is, any value at least as far from 100 in either direction 
(because ¥ appears in the alternative hypothesis). Thus any value of ¢ that is either 
<—1.30 or >1.30 is at least as contradictory to Ho as our calculated t= —1.30. 


5.4.3 P-Values and the One-Sample t Test 


Before data have been obtained, the sample mean and sample standard deviation are 
random variables, which we have previously denoted by X and S, respectively. 
Substituting these for x and s in the formula for f gives what is called the test 
statistic 
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_X=Ho 
S/n’ 


which is also a random variable (that is, its value is subject to uncertainty prior to 
obtaining the sample data). 

If the population distribution is normal, Gosset’s Theorem from Sect. 5.3 implies 
that when the null hypothesis is true, T has a t distribution with n—1 degrees of 
freedom. In the case of the nylon string scenario, T = (X — 100) /(S/./n) would 
have a ¢,_; distribution when Ho: 44= 100 is true (assuming that the population 
strength distribution is normal). For the previously given sample information, the 
calculated value of the test statistic was 3.50. Now consider the probability, 
calculated assuming that the null hypothesis is true, of obtaining a test statistic 
value at least as contradictory to the null hypothesis as the value 3.50 resulting from 
our sample data: 


Py,(T > 3.50) = P(a t4 random variable is at least 3.50) 
= the area under the fy4 curve to the right of 3.50 
= .001(from software) 


That is, if the null hypothesis is true, there is only a .1% chance of obtaining a 
sample at least as contradictory to the null hypothesis as our sample. So our sample 
is among the .1% of all samples most contradictory to Ho. 

Recall that t= —1.30 in the IQ scenario. Again assuming a normal population 
distribution, the probability of getting a value of T at least as contradictory to Ho 
when Hp is true is 


Py, (T < —1.30 or T > 1.30) = Pla tog rv is < —1.30 or > 1.30) 
= (area under the f24 curve to the left of —1.30) + 
(area under the f24 curve to the right of 1.30) 
= 2(area under the fy24 curve to the right of 1.30) 
~~ 2(area under the z curve to the right of 1.30) 
= .1936 


So when the null hypothesis is true, almost 20% of all samples would result in a 
test statistic value that is at least as contradictory to Hp as the one resulting from our 
sample. This implies that our sample is not very contradictory to Hp. 


DEFINITION 

The P-value is the probability, calculated assuming that the null hypothesis is 
true, of obtaining a value of the test statistic at least as contradictory to Hp as 
the value calculated from the available sample. The smaller the P-value, the 


(continued) 
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more the data contradicts the null hypothesis, so Hp should be rejected in 
favor of H, if the P-value is sufficiently small. 

More specifically, select a number a reasonably close to 0; then reject 
the null hypothesis if P-value <a and do not reject the null hypothesis if 
P-value > a. The selected a is called the significance level of the test. 


The most frequently employed values of the significance level are a= .05, .01, 
and .001. We shall say more about the choice of a@ shortly. 


ONE-SAMPLE T TEST 

Consider testing the null hypothesis Ho: 44 = po based on a random sample X,, 
X>, ..., X, from a normal population distribution (the plausibility of the 
normality assumption should be checked by examining a normal probability 
plot). The test statistic is 


_X=Ho 
S/n 


The calculated value of this test statistic is t= (¥ — uo) /(s/./n). The 
determination of the P-value depends on the choice of H, as follows: 


T 


Alternative Hypothesis P-value 

A: p> Mo Area under the ¢,_; curve to the right of ¢ 

Aly: M< Uo Area under the ¢,,_; curve to the left of ¢ 

Ho: uF Ho 2-(Area under the f,_; curve to the right of I¢l) 


The test procedure when the alternative hypothesis is H,: 4 > Mo is referred to as 
an upper-tailed test, because the P-value is the area captured in the upper tail of the 
relevant ¢ curve (i.e., to the right of 4). Analogously, the test procedure for the 
second case is called a lower-tailed test, and the procedure in the third case is a two- 
tailed test. Figure 5.10 illustrates the determination of the P-value in the three 
different cases. 

Appendix Table A.6 provides information about tail areas under various 
t curves. The calculated value of ¢ (to the accuracy of the tenths digit) appears 
along the left margin, and there is a different column for each number of df. For 
example, the entry at the intersection of the t= 2.4 row and the 15 df column is .015, 
the area under the 15 df t¢ curve to the right of 2.4. By symmetry, this is also the area 
under the 15 df t curve to the left of —2.4. Various software packages will allow for 
more decimal accuracy in ¢t and the corresponding areas. 


Example 5.16 Correct alignment of the tibial and femoral components is an 
important factor in determining favorable long-term results of total knee 
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t curve for relevant df 


\ P-value = area in upper tail 


1. Upper-tailed test 


H, contains the inequality > 


Calculated ¢ 


t curve for relevant df 


P-value = area in lower tail 
2. Lower-tailed test 


H, contains the inequality < 


Calculated ¢ 


P-value = sum of area in two tails 


t curve for relevant df 


3. Two-tailed test 


H, contains the inequality # 


Calculated t, -t 


Fig. 5.10 P-values for ¢ tests 


arthroplasty (TKA). It is generally accepted that the tibial component should be 
placed perpendicular to the anatomical axis of the tibia. The article “Simple Method 
for Confirming Tibial Osteotomy During Total Knee Arthroplasty” (Sports Medi- 
cine, Arthroscopy, Rehabilitation, Therapy, and Technology, 2012, 4:44) reported 
that for a sample of 35 TKAs, the sample mean varus angle of the tibial osteotomy 
was 89.45° and the sample standard deviation was 1.62°. The authors of the cited 
article carried out a one-sample f test to see whether the true average angle differed 
from 90° (presumably after examining a normal probability plot of the data). The 
relevant hypotheses are Ho: «= 90 versus H,: » #90. 
The calculated value of the test statistic is 


t = (89.46 — 90)/ (1.62/v35) = -1.97 


The inequality in H,, implies that the test is two-tailed, so the P-value is twice the 
area under the f34 curve to the right of 1.97. The entry in the 2.0 row and 35 df 
column of Table A.6 is .027, so the P-value is approximately 2(.027) = .054 (the 
article reports .055; notice that we have had to round both the test statistic and the df 
in order to use the f¢ table). 
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Thus with a significance level of .05, the null hypothesis cannot be rejected 
because P-value = .054 > .05 =a. This is what allowed the investigators to con- 
clude that “there was no significant difference from the target angle of 90°.” 


Recall from the previous section on confidence intervals that when the sample 
size n is large, the standardized variable (X — y)/(S/,/n) has approximately a 
standard normal distribution even if the population distribution is not normal. The 
implication here is that we can relabel our test statistic as Z = (X — po) /(S/\/n). 
Then the prescription in the one-sample ¢t box for obtaining the P-value is modified 
by replacing ¢,,_, and ¢ by z. That is, the P-value for these large-sample tests is an 
appropriate z curve area. 


Example 5.17 The recommended daily intake of calcium for adults ages 18-30 is 
1000 mg/day. The article “Dietary and Total Calcium Intakes Are Associated with 
Lower Percentage Total Body and Truncal Fat in Young, Healthy Adults” (/. of the 
Amer. College of Nutr., 2011: 484-490) reported the following summary data for a 
sample of 76 healthy Caucasian males from southwestern Ontario, Canada: n = 76, 
xX = 1093, s=477. Let’s carry out a test at significance level .01 to see whether the 
population mean daily intake exceeds the recommended value. The relevant 
hypotheses are Ho: «= 1000 versus H,: p > 1000. 
The calculated value of the test statistic is 


z= (1093 — 1000)/(477/V76) = 1.70 


The resulting P-value is the area under the standard normal curve to the right of 
1.70 (the inequality in H, implies that the test is upper-tailed). From Table A.3, this 
area is 1 — ®(1.70) = 1 — .9554 = .0446. Because this P-value is larger than .01, Ho 
cannot be rejected. There is not compelling evidence to conclude at significance 
level .01 that the population mean daily intake exceeds the recommended value 
(even though the sample mean does so). Note that the opposite conclusion would 
result from using a significance level of .05. But the smaller @ that we used requires 
more persuasive evidence from the data before rejecting Ho. a 


5.4.4 Errors in Hypothesis Testing and the Power of a Test 


When a jury is called upon to render a verdict in a criminal trial, there are two 
possible erroneous conclusions to be considered: convicting an innocent person, or 
letting a guilty person go free. Similarly, in statistical hypothesis testing there are 
two potential errors whose consequences must be considered when reaching a 
conclusion. 
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DEFINITION 
A Type I error involves rejecting the null hypothesis Ho when it is true. 
A Type II error involves not rejecting Hp when it is false. 


Since in the US judicial system the null hypothesis (a priori belief) is that the 
accused is innocent, a Type I error is analogous to convicting an innocent person. It 
would be nice if test procedures could be developed that offered 100% protection 
against committing both a Type I error and a Type II error. This is an impossible 
goal, however, because a conclusion is based on sample data rather than a census of 
the entire population. There is always some chance that the sample will be unrep- 
resentative of the population and lead to an incorrect conclusion. The best we can 
hope for is a test procedure for which it is unlikely that either a Type I or a Type II 
error will be committed. 

Let’s reconsider the calcium intake scenario of the previous example. We 
employed a significance level of a=.01, and so we could reject Ho only if P- 
value < .01. The P-value in the case of this upper-tailed large-sample test is the area 
under the standard normal curve to the right of the calculated z. Table A.3 shows 
that the z-value 2.33 captures an upper-tail area of .01 (look inside the table for a 
cumulative area of .9900). The P-value (captured upper-tail area) will therefore be 
at most .01 if and only if z is at least 2.33; see Fig. 5.11. 

Thus the probability of committing a Type I error—rejecting Ho when it is 
true—is the probability that the value of the test statistic Z will be at least 2.33 when 
Ho is true. Now the key fact: because we created Z by subtracting the null value 
Mo when standardizing, Z has a standard normal distribution when Ho is true. So 


P(Type I error) = P(rejecting Hp when Hp is true) 
= P(Z > 2.33 when Z is a standard normal rv) = .01 =a 


This is true not only for the z test of Example 5.17 but also for the ¢ tests 
described earlier and, in fact, for any test procedure. 


P-value < .01 P-value > .01 


2.33 z & 2.33 


Fig. 5.11 P-values for an upper-tailed large-sample test: (a) P-value <.01 if z>2.33; 
(b) P-value >.01 if z< 2.33 
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PROPOSITION 
The significance level a@ that is employed when Hp is rejected iff P-value < a 
is also the probability that the test results in a Type I error. 


Thus a test with significance level .01 is one for which there is a 1% chance of 
committing a Type I error, whereas using a significance level of .05 results in a test 
with a Type I error probability of .05. The smaller the significance level, the less 
likely it is that the null hypothesis will be rejected when it is true. A smaller 
significance level makes it harder for the null hypothesis to be rejected and 
therefore less likely that a Type I error will be committed. 

It is natural to ask at this point why a significance level of .05 should ever be 
employed when a significance level of .01 can be used. More generally, why use a 
test with a larger significance level—larger probability of a Type I error—when a 
smaller level is available? The answer lies in something that we have not yet 
explicitly considered: the likelihood of committing a Type II error. Let’s denote 
the probability of a Type II error by /. That is, 


p= P(not rejecting Hg when H, is true) 


This notation is actually somewhat misleading: whereas for any particular test 
there is a single value of a (a consequence of having Ho be a statement of equality), 
there are in fact many different values of /. The alternative hypothesis in the 
calcium intake situation was H,: 4 > 1000. So this would be true if ~ were 1010 
or 1050 or 1100 or in fact any value exceeding 1000. Nevertheless, for any 
particular way of Ho being false and H, true, it can be shown that a and f are 
inversely related: changing the test procedure by decreasing a in order to make the 
chance of a Type I error smaller has the inevitable consequence of making a Type II 
error more likely. Conversely, using a larger significance level will make it less 
likely that the null hypothesis will fail to be rejected when, in fact, it is false. 

Let y’ denote some particular value of for which H, is true. For example, for 
the hypotheses Ho: «= 90 versus H,: »~90 from Example 5.16, we might be 
interested in determining / when the true angle is 91°. Then y’ =91 and we wish 
B91). The value of / depends on several factors: 


* How far the alternative value of interest y’ is from po [P(u') decreases as p' 
moves further away from fo] 

* The sample size n [(u') decreases as n, and therefore df, increases] 

¢ The value of the population standard deviation o [the larger the value of o, the 
more difficult it is for Hp to be rejected, and so the larger is B(w’)] 

¢ The significance level a [making a smaller increases /(y')] 


Calculating / for the one-sample ¢ test by hand is quite difficult. This is because 
when =p’ rather than the null value yo, the density function that describes the 
distribution of the test statistic T is exceedingly complicated. Fortunately statistical 
software comes to our rescue. Rather than work directly with /, the most commonly 
used software packages involve a quantity called power. 
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DEFINITION 

Suppose the null and alternative hypotheses are assertions about the value of 
some parameter 9, with the null hypothesis having the form Ho: 8 = 6p and the 
alternative hypothesis obtained by replacing = in Ho by one of the three 
inequalities >, < or 4. Let 0 denote some particular value of 0 for which H, 
is true. Then the power at the value 0’ for a test of these hypotheses is the 
probability of rejecting Hp when 0 = 0’, which is 1 — £(6’). The power of the 
test when 6 = p is also the probability that Ho is rejected, which in this case is 
the significance level a. 


Thus we want the power to be close to 0 when the null hypothesis is true and 
close to 1 when the null hypothesis is false. A “powerful” test is one that has high 
power for alternative values of the parameter, and thus good ability to detect 
departures from the null hypothesis. 


Example 5.18 The true average voltage drop from collector to emitter of insulated 
gate bipolar transistors of a certain type is supposed to be at most 2.5 V. An 
investigator selects a sample of n= 10 such transistors and uses the resulting 
voltages as a basis for testing Ho: w~=2.5 versus H,: w > 2.5 using a ¢ test with 
significance level a= .05. If the standard deviation of the voltage distribution is 
o=.1 V, how likely is it that Ho will not be rejected when in fact w= 2.55 or when 
jt= 2.6? And what happens to the power and / if the sample size is increased to 20? 
The sampsizepwr function in Matlab provides the following information: 


uw n Power 
25 10 4273 
2.55 20 6951 
2.6 10 8975 
2.6 20 .9961 


So in the case yp’ = 2.55, f is roughly .57 when the sample size is 10 and roughly 
.30 when the sample size is 20. Clearly these Type II error probabilities are rather 
large. If it is important to detect such a departure from Ho, the test does not have 
good power to do so. Software can also be used to determine what value of the 
sample size n is necessary to produce a sufficiently large power and correspond- 
ingly small £. For example, when p’ = 2.55, a sample size of n = 36 is required to 
produce a power of .90. a 


As Example 5.18 illustrates, the power of a test can be disappointingly small for 
an alternative value of the parameter that represents an important departure from 
the null hypothesis. Too often investigators are content to specify a comfortingly 
small value of a without paying attention to power and /. This can easily result in a 
test which has poor ability to detect when Ho is false. Given the availability and 
capabilities of statistical software packages, such a sin is unpardonable! 
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5.4.5 Software for Hypothesis Test Calculation 


The t.test and ttest functions in R and Matlab, respectively, mentioned at the 
end of Sect. 5.3 can be used to automatically perform the one-sample ¢ test 
described in this section (in fact, that is the primary purpose of these functions). 


Example 5.19 The accompanying data on cube compressive strength (MPa) of 
concrete specimens appeared in the article “Experimental Study of Recycled 
Rubber-Filled High-Strength Concrete” (Magazine of Concrete Res., 2009: 
549-556): 


112.3 97.0 92.7 86.0 102.0 
99.2 95.8 103.5 89.0 86.7 


Suppose the concrete will be used for a particular application unless there is 
strong evidence that the true average strength is less than 100 MPa. Should the 
concrete be used? Test at the .05 significance level. 

Let y denote the true average cube compressive strength of this concrete. We 
wish to test the hypotheses Ho: w= 100 versus H,: «< 100. A probability plot 
indicates the data are consistent with a normally distributed population. Figure 5.12 
shows the hypothesis test implemented in R and Matlab. 

Both R and Matlab give a one-sided P-value of .1315 at 9 df. Since .1315 > .05, 
at the .05 significance level we fail to reject Ho: there is insufficient evidence to 
conclude the true mean strength of this concrete is less than 100 MPa. As a 
consequence, the concrete should be used. 

In Fig. 5.12a, R gives the computed value of the test statistic, t= — 1.1937, as 
well as the sample mean, ¥ = 96.92 MPa, and a (one-sided) CI for yw of (—co, 
101.6497). (See Exercises 46-47 from the previous section for information on such 
bounds.) In Matlab, the significance level of .05 is a required input; the ’left’ 


a b 

> x<-¢(112.3,97.0;97.7,86.0, >> x=[112.3,97.0,97.7,86.0, 

102.0,99.2,95.8,103.5 102.0,99.2,95.8,103.5, 

,89.0,86.7) 89.0,86.7];7 

> t.test (x,mu=100,alternative="less") >> [H,P]=ttest(x,100, .05,'left"') 
H= 

One Sample t-test 0 
data: x P= 
t = -1.1937, df = 9, p-value = 0.1315 Owes 


alternative hypothesis: true mean is 
less than 100 
95 percent confidence interval: 
-Inf 101.6497 
sample estimates: 
mean of x 
96.92 


Fig. 5.12 Performing the hypothesis test of Example 5.19: (a) R; (b) Matlab 
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argument instructs Matlab to perform a lower-tailed test. As seen in Fig. 5.12b, 
Matlab then returns two items: the P-value, and also a logical bit denoted H 
indicated whether to reject Hp. (The H = 0 output tells the user not to reject Ho at 
the specified « level.) a 


Calculations of power, as well as sample size required to achieve a prescribed 
power level, are available through the samplesizepwr function in Matlab and 
the pwr package in R. The former is part of the Matlab Statistics Toolbox; the latter 
is not part of the R base package and must be downloaded and installed. 


5.4.6 Exercises: Section 5.4 (51-76) 


51. For each of the following assertions, state whether it is a legitimate statistical 
hypothesis and why: 
(a) H: o> 100 
(b) H: P = .45 
(c) H: $< .20 
(d) H: o,/6.<1 
(e) H:X-—Y=5 
(f) H: A<.01, where / is the parameter of an exponential distribution used to 

model component lifetime 

52. For the following pairs of assertions, indicate which do not comply with our 
rules for setting up hypotheses and why (the subscripts | and 2 differentiate 
between quantities for two different populations or samples): 
(a) Ho: w= 100, H,: w > 100 
(b) Ho: o= 20, Ha: 6 < 20 
(c) Ho: p#.25, Hy: p= .25 
(d) Ho: #1 — #2 = 25, Ha: fy — Ho > 100 
(e) Ho: St =S3, Ha: St #S3 
(f) Ho: w= 120, Hy: w= 150 
(g) Ho: o1/02= 1, Ha: 01/02 F 1 
(h) Ho: pi — p2=—.1, Hat pi — p2<—.1 

53. To determine whether the girder welds in a new performing arts center meet 
specifications, a random sample of welds is selected, and tests are conducted 
on each weld in the sample. Weld strength is measured as the force required to 
break the weld. Suppose the specifications state that mean strength of welds 
should exceed 100 Ib/in?; the inspection team decides to test Ho: w= 100 
versus H,: 4 > 100. Explain why it might be preferable to use this 1, rather 
than « < 100. 

54. Let denote the true average radioactivity level (picocuries per liter). The 
value 5 pCi/L is considered the dividing line between safe and unsafe water. 
Would you recommend testing Ho: 4 =5 versus H,: uv >5 or Ho: uw =5 versus 
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H,: «<5? Explain your reasoning. [Hint: Think about the consequences of a 
Type I and Type II error for each possibility.] 

For which of the given P-values would the null hypothesis be rejected when 
performing a level .05 test? 

(a) .001 

(b) .021 

(c) .078 

(d) .047 

(e) .148 

Pairs of P-values and significance levels, a, are given. For each pair, state 
whether the observed P-value would lead to rejection of Ho at the given 
significance level. 

(a) P-value = .084, a=.05 

(b) P-value = .084, a= .10 

(c) P-value = .003, a=.01 

(d) P-value = .039, a= .01 

Give as much information as you can about the P-value of a ¢ test in each of 
the following situations: 

(a) Upper-tailed test, df= 8, r= 2.0 

(b) Lower-tailed test, df= 11, t= —2.4 

(c) Two-tailed test, df= 15, t= —1.6 

(d) Upper-tailed test, df= 19, r= —.4 

(e) Upper-tailed test, df=5, t=5.0 

(f) Two-tailed test, df = 40, t= —4.8 

The paint used to make lines on roads must reflect enough light to be clearly 
visible at night. Let denote the true average reflectometer reading for a new 
type of paint under consideration. A test of Ho: 4 = 20 versus H,: > 20 will 
be based on a random sample of size n from a normal population distribution. 
What conclusion is appropriate in each of the following situations? 

(a) n= 15, test statistic value = 3.2, a= .05 

(b) n=9, test statistic value = 1.8, a= .01 

(c) n= 24, test statistic value = —.2 

Let y« denote the mean reaction time to a certain stimulus. For a large-sample 
z test of Hp: 4=5 versus H,: wy > 5, find the P-value associated with each of 
the given values of the z test statistic. 

(a) 1.42 

(b) .90 

(c) 1.96 

(d) 2.48 

(e) —.11 

Newly purchased tires of a certain type are supposed to be filled to a pressure 
of 35 Ib/in?. Let w denote the true average pressure. Find the P-value 
associated with each given z statistic value for testing Hp: 4 = 35 versus the 
alternative H,: «#35. 

(a) 2.10 

(b) —1.75 
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61. 


62. 


63. 


(c) —.55 
(d) 1.41 
(e) —5.3 


A pen has been designed so that true average writing lifetime under controlled 
conditions (involving the use of a writing machine) is at least 10 h. A random 
sample of 18 pens is selected, the writing lifetime of each is determined, and a 
normal probability plot of the resulting data supports the use of a one-sample 
t test. 
(a) What hypotheses should be tested if the investigators believe a priori that 
the design specification has been satisfied? 
(b) What conclusion is appropriate if the hypotheses of part (a) are tested, 
t= —2.3, and a= .05? 
(c) What conclusion is appropriate if the hypotheses of part (a) are tested, 
t=—1.8, and a=.01? 
(d) What should be concluded if the hypotheses of part (a) are tested and 
t= —3.6? 
Lightbulbs of a certain type are advertised as having an average lifetime of 
750 h. The price of these bulbs is very favorable, so a potential customer has 
decided to go ahead with a purchase arrangement unless it can be conclusively 
demonstrated that the true average lifetime is smaller than what is advertised. 
A random sample of 50 bulbs was selected and the lifetime of each bulb 
determined. These 50 light bulbs had a sample mean lifetime of 738.44 h with 
a sample standard deviation of 38.20 h. What conclusion would be appropriate 
for a significance level of .05? 
Automatic identification of the boundaries of significant structures within a 
medical image is an area of ongoing research. The paper “Automatic Segmen- 
tation of Medical Images Using Image Registration: Diagnostic and Simulation 
Applications” (J. of Medical Engr. and Tech., 2005: 53-63) discussed a new 
technique for such identification. A measure of the accuracy of the automatic 
region is the average linear displacement (ALD). The paper gave the following 
ALD observations for a sample of 49 kidneys (units of pixel dimensions). 


1.38 0.44 1.09 0.75 0.66 1.28 0.51 
0.39 0.70 0.46 0.54 0.83 0.58 0.64 
1.30 0.57 0.43 0.62 1.00 1.05 0.82 
1.10 0.65 0.99 0.56 0.56 0.64 0.45 
0.82 1.06 0.41 0.58 0.66 0.54 0.83 
0.59 0.51 1.04 0.85 0.45 0.52 0.58 
1.11 0.34 1.25 0.38 1.44 1.28 0.51 


(a) Is it plausible that ALD is at least approximately normally distributed? 
Must normality be assumed prior to testing hypotheses about true average 
ALD? Explain. 
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(b) The authors commented that in most cases the ALD is better than or of the 
order of 1.0. Does the data in fact provide strong evidence for concluding 
that true average ALD under these circumstances is less than 1.0? Carry 
out an appropriate test of hypotheses. 

A dynamic cone penetrometer (DCP) is used for measuring material resis- 

tance to penetration (mm/blow) as a cone is driven into pavement or subgrade. 

Suppose that for a particular application it is required that the true average 

DCP value for a certain type of pavement be less than 30. The pavement will 

not be used unless there is conclusive evidence that the specification has been 

met. Test the appropriate hypotheses using the following data (“Probabilistic 

Model for the Analysis of Dynamic Cone Penetrometer Test Values in 

Pavement Structure Evaluation,” J. of Testing and Evaluation, 1999: 7-14): 


14.1 14.5 15.5 16.0 16.0 16.7 16.9 V7.1 17.5 17.8 
17.8 18.1 18.2 18.3 18.3 19.0 19.2 19.4 20.0 20.0 
20.8 20.8 21.0 21.5 23.5 27.5 27.5 28.0 28.3 30.0 
30.0 31.6 31.7 31.7 32:5 33.5 33.9 35.0 35.0 35.0 
36.7 40.0 40.0 41.3 41.7 47.5 50.0 51.0 51.8 54.4 
55.0 57.0 


The article “Uncertainty Estimation in Railway Track Life-Cycle Cost” (VJ. of 
Rail and Rapid Transit, 2009) presented the following data on time to repair 
(min) a rail break in the high rail on a curved track of a certain railway line. 


159 120. 480 149 270 547 340 43 228 202 240 = 218 


A normal probability plot of the data shows a reasonably linear pattern, so it 
is plausible that the population distribution of repair time is at least approxi- 
mately normal. The sample mean and standard deviation are 249.7 and 145.1, 
respectively. Is there compelling evidence for concluding that true average 
repair time exceeds 200 min? Carry out a test of hypotheses using a signifi- 
cance level of .05. 

Have you ever been frustrated because you could not get a container of some 
sort to release the last bit of its contents? The article “Shake, Rattle, and 
Squeeze: How Much Is Left in That Container?” (Consumer Reports, May 
2009: 8) reported on an investigation of this issue for various consumer 
products. Suppose five 6.0 oz tubes of toothpaste of a particular brand are 
randomly selected and squeezed until no more toothpaste will come out. Then 
each tube is cut open and the amount remaining is weighed, resulting in the 
following data (consistent with what the cited article reported): .53, .65, .46, 
50, .37. Does it appear that the true average amount left is less than 10% of 
the advertised net contents? 

(a) Check the validity of any assumptions necessary for testing the appropri- 

ate hypotheses. 


5.4 Testing Hypotheses About a Population Mean 489 


67. 


68. 


69. 


70. 


(b) Carry out a test of the appropriate hypotheses using a significance level of 
.05. Would your conclusion change if a significance level of .01 had been 
used? 

(c) Describe in context Type I and II errors, and say which error might have 
been made in reaching a conclusion. 

A random sample of soil specimens was obtained, and the amount of organic 

matter (%) in the soil was determined for each specimen, resulting in the 

accompanying data (from “Engineering Properties of Soil,” Soil Science, 

1998: 93-102). 


1.10 5.09 0.97 1.59 4.60 0.32 0.55 1.45 
0.14 4.47 1.20 3.50 5.02 4.67 5.22 2.69 
3.98 3.17 3.03 2.21 0.69 4.47 3.31 1.17 
0.76 1.17 1.57 2.62 1.66 2.05 


The values of the sample mean and standard deviation are 2.481 and 1.616, 
respectively. Does this data suggest that the true average percentage of 
organic matter in such soil is something other than 3%? Carry out a test of 
the appropriate hypotheses at significance level .10. Would your conclusion 
be different if a = .05 had been used? [Note: A normal probability plot of the 
data shows an acceptable pattern in light of the reasonably large sample size.] 
Glycerol is a major by-product of ethanol fermentation in wine production and 
contributes to the sweetness, body, and fullness of wines. The article “A Rapid 
and Simple Method for Simultaneous Determination of Glycerol, Fructose, 
and Glucose in Wine” (American J. of Enology and Viticulture, 2007: 279- 
283) includes the following observations on glycerol concentration (mg/ml) 
for samples of standard-quality (uncertified) white wines: 2.67, 4.62, 4.14, 
3.81, 3.83. Suppose the desired concentration value is 4. Does the sample data 
suggest that true average concentration is something other than the desired 
value? Carry out a test of appropriate hypotheses using the one-sample ¢ test 
with a significance level of .05. 

Exercise 41 gave n = 26 observations on escape time (seconds) for oil workers 
in a simulated exercise, from which the sample mean and sample standard 
deviation are 370.69 and 24.36, respectively. Suppose the investigators had 
believed a priori that true average escape time would be at most 6 min. Does 
the data contradict this prior belief? Assuming normality, test the appropriate 
hypotheses using a significance level of .05. 

Minor surgery on horses under field conditions requires a reliable short-term 
anesthetic producing good muscle relaxation, minimal cardiovascular and 
respiratory changes, and a quick, smooth recovery with minimal aftereffects 
so that horses can be left unattended. The article “A Field Trial of Ketamine 
Anesthesia in the Horse” (Equine Vet. J., 1984: 176-179) reports that for a 
sample of n= 73 horses to which ketamine was administered under certain 
conditions, the sample average lateral recumbency (lying-down) time was 
18.86 min and the standard deviation was 8.6 min. Does this data suggest that 
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true average lateral recumbency time under these conditions is less than 
20 min? Test the appropriate hypotheses at level of significance .10. 

The recommended daily dietary allowance for zinc among males older than 
age 50 years is 15 mg/day. The article “Nutrient Intakes and Dietary Patterns 
of Older Americans: A National Study” (J. Gerontol., 1992: M145—150) 
reports the following summary data on intake for a sample of males age 65— 
74 years: n= 115, xX = 11.3, and s=6.43. Does this data indicate that average 
daily zinc intake in the population of all males age 65-74 falls below the 
recommended allowance? 

The industry standard for the amount of alcohol poured into many types of 
drinks (e.g., gin for a gin and tonic, whiskey on the rocks) is 1.5 oz. Each 
individual in a sample of 8 bartenders with at least 5 years of experience was 
asked to pour rum for a rum and coke into a short, wide (tumbler) glass, 
resulting in the following data: 


2.00 1.78 2.16 1.91 1.70 1.67 1.83 1.48 


(Summary quantities agree with those given in the article “Bottoms Up! 
The Influence of Elongation on Pouring and Consumption Volume,” 
J. Consumer Res., 2003: 455-463.) 

(a) Carry out a test of hypotheses to decide whether there is strong evidence 
for concluding that the true average amount poured differs from the 
industry standard. 

(b) Does the validity of the test you carried out in (a) depend on any 
assumptions about the population distribution? If so, check the plausibil- 
ity of such assumptions. 

Before agreeing to purchase a large order of polyethylene sheaths for a 

particular type of high-pressure oil-filled submarine power cable, a company 

wants to see conclusive evidence that the true standard deviation of sheath 
thickness is less than .05 mm. What hypotheses should be tested, and why? In 
this context, what are the Type I and Type II errors? 

Many older homes have electrical systems that use fuses rather than circuit 

breakers. A manufacturer of 40-amp fuses wants to make sure that the mean 

amperage at which its fuses burn out is in fact 40. If the mean amperage is 
lower than 40, customers will complain because the fuses require replacement 
too often. If the mean amperage is higher than 40, the manufacturer might be 
liable for damage to an electrical system due to fuse malfunction. To verify 
the amperage of the fuses, a sample of fuses is to be selected and inspected. If 

a hypothesis test were to be performed on the resulting data, what null and 

alternative hypotheses would be of interest to the manufacturer? Describe 

Type I and Type II errors in the context of this problem situation. 

Water samples are taken from water used for cooling as it is being discharged 

from a power plant into a river. It has been determined that as long as the mean 

temperature of the discharged water is at most 150 °F, there will be no 
negative effects on the river’s ecosystem. To investigate whether the plant 
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is in compliance with regulations that prohibit a mean discharge-water tem- 
perature above 150°, 50 water samples will be taken at randomly selected 
times, and the temperature of each sample recorded. The resulting data will be 
used to test the hypotheses Ho: «= 150° versus H,: 4 > 150°. In the context of 
this situation, describe Type I and Type II errors. Which type of error would 
you consider more serious? Explain. 

76. A regular type of laminate is currently being used by a manufacturer of circuit 
boards. A special laminate has been developed to reduce warpage. The regular 
laminate will be used on one sample of specimens and the special laminate on 
another sample, and the amount of warpage will then be determined for each 
specimen. The manufacturer will then switch to the special laminate only if it 
can be demonstrated that the true average amount of warpage for that laminate 
is less than for the regular laminate. State the relevant hypotheses, and 
describe the Type I and Type II errors in the context of this situation. 


5.5 Inferences for a Population Proportion 


The previous two sections illustrated the methods of confidence intervals and 
hypothesis testing for an unknown mean, y. In this section, we will apply those 
same ideas to drawing inferences about an unknown probability or population 
proportion. 

Let p denote the proportion of “successes” in a population, where success 
identifies an individual or object that has a specified property. Equivalently, p is 
the probability that a randomly selected individual or object is a success. A random 
sample of n individuals is to be selected, and X denotes the number of successes in 
the sample. The natural estimator of p isP =X /n, the sample fraction of successes. 
As derived in Sect. 2.4 and discussed earlier in this chapter, E (P) = p (unbiased- 


ness) and SD(P) = ,/p(1 — p)/n; moreover, provided np > 10 and n(1 — p) > 10, 
P has approximately a normal distribution. 


5.5.1. Confidence Intervals for p 


Since P is approximately normal, standardizing P by subtracting p and dividing by 
os implies that, for example, 


=P 
p(l—p 


o( 1.96 < a 199) m 95 
)/n 


A confidence interval for p results from replacing each < by = and solving the 
resulting quadratic equation for p. After some tedious algebra, this gives the two 
roots 
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(P + 1.96?/2n) + 1.96)/P (1 — P)/n + 1.967 /4n? 
1+ 1.967 /n 


p= 


These form the endpoints of an approximate 95% CI for p. The more general 
formula is given in the following proposition. 


ONE-PROPORTION Z INTERVAL 
Let p be the fraction of successes in a random sample of size n. Then a 
confidence interval for the true/population proportion p has endpoints 


«ge VEU p)/n+ (er) /4r? 
aa 1+ (ey) /n 


(5.3) 


where z* is the standard normal critical value for the specified confidence 
level (e.g., z* = 1.96 for 95% confidence) and p is the adjusted sample 
proportion of successes defined by p = |p + (z*)?/ 2n]/{1+ (z*)?/ nj. 

This is often referred to as the score confidence interval for p. 


If the sample size n is very large, then all the terms in Expression (5.3) of order 
1/n are negligible compared to the others. Keeping only the dominant terms, 
Eq. (5.3) is approximated by 


px#z*. 


(5.4) 


This approximate CI (Eq. 5.4) has the format p + z* - 65, similar to the large- 
sample CI for presented in Sect. 5.3, and is the one that for decades has appeared 
in introductory statistics textbooks. It clearly has a much simpler and more appeal- 
ing form than Eq. (5.3), so why bother with the score interval at all? 

Suppose we use z= 1.96 in the traditional formula (5.4). Then our nominal 
confidence level (the one we think we’re buying by using that z critical value) is 
approximately 95%. So before a sample is selected, the probability that the random 
interval includes the actual value of p (i.e., the coverage probability) should be 
about .95. But it turns out that the actual coverage probability for this interval can 
differ considerably from the nominal probability .95, particularly when p is not 
close to .5. This is, generally speaking, a deficiency of the traditional interval—the 
actual confidence level can be quite different from the nominal level even for 
reasonably large sample sizes. Recent research has shown that the score interval 
(Eq. 5.3) rectifies this behavior—for virtually all sample sizes and values of p, its 
actual confidence level will be quite close to the nominal level specified by the 
choice of z. This is due largely to the fact that the interval (in particular, the 
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midpoint p) is shifted a bit toward .5 compared to the traditional interval. This is 
especially important when p is close to 0 or 1. 

In addition, the score interval can be used with nearly all sample sizes 
and parameter values. It is thus not necessary to check the conditions np > 10 
and (1 — p) > 10 which would be required were the traditional interval employed. 
So rather than asking when n is large enough for Eq. (5.4) to yield a good 
approximation to Eq. (5.3), our recommendation is that the score CI should always 
be used unless the sample size is extremely large (such as in simulations, where 
n= 10,000 or more is typical). The slight additional tediousness of the computation 
is outweighed by the desirable properties of the interval. 


Example 5.20 A Gallup poll published June 28, 2013 reported that 41% of US 
adults surveyed felt that the most important factor in choosing which college or 
university to attend should be the percentage of graduates who are able to get a good 
job. (This was the most popular response; cost of tuition was a close second.) The 
survey was based on a random sample of n= 1012 adults; we will assume the 
number who gave the above response is x= 415, so that p = 415/1012 = .4101, 
matching the survey. Let p denote the proportion of a// US adults that feel this same 
way, for which p is our point estimate. A confidence interval for p with a confidence 
level of approximately 95% is 


4101 + 1.96? /2(1012) 5 gg \/ (4101) (.5899) /1012 + 1.962/(4- 1012?) 
1+ 1.967/1012 1+ 1.967/1012 


= .4103 + .0302 = (.3801, 4405) 


Hence, we are 95% confident that between 38 and 44% of all US adults feel that 
the percentage of graduates that get good jobs is the most important factor when 
choosing a college or university. The traditional interval is 


(410) (.590) 
1012 


4101 + 1.96 


= 4101 + .0303 = (.3798, 4404) 


These two intervals are practically identical because n= 1012 is so large. 


Example 5.21 The article “Repeatability and Reproducibility for Pass/Fail Data” 
(J. Testing Eval., 1997: 151-153) reported that in n = 48 trials in a particular labora- 
tory, 16 resulted in ignition of a particular type of substrate by a lighted cigarette. Let 
p denote the long-run proportion of all such trials that would result in ignition. A point 
estimate for p is bp = 16/48 = .333. A 95% confidence interval for p is 
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333 + 1.967/96 \/(:333)(.667) /48 + 1.967/(4 - 48”) 
=—— + 1.96 ; = 346 + .129 
1 + 1.967/48 1+ 1.967/48 


= (.217, 475) 


So, the researchers can be 95% confident that between 21.7 and 47.5% of all 
trials under the same conditions will result in ignition. This interval isn’t very 
precise—its width is nearly 26 percentage points—as a consequence of the rela- 
tively small sample size. If the researchers wanted a narrower interval, they would 
need to use a larger n (which, of course, requires more time and money). 

The traditional CI formula (5.4) gives 


333 £ 1.96 /(.333)(.667) /48 = .333 + .133 = (.200, 466) 


These two intervals are somewhat different because n = 48 is not very large. 


5.5.2 Hypothesis Testing for p 


Analogous to hypothesis testing for a population mean yw, tests for p concern 
deciding which of two competing hypotheses about the value of p is correct. The 
null hypothesis will always be written in the form 


Ho: p = Po 


where po is the null value for the parameter p (i.e., the value claimed for p by the 
null hypothesis). The alternative hypothesis has one of three forms, depending on 
context: 


Ay: p >Py Hai P<Po Ha: P FPo 


Inferences about p are again based on the value of a sample proportion, P. When 
Hp is true, E(P) = Po and SD(P) = \/po(1 — po)/n. Moreover, when n is large 
and Ho is true, the test statistic 


P —po 


Fi Ee 
Po(1 — po)/n 


has approximately a standard normal distribution. The P-value of the hypothesis 
test is then determined in an analogous manner to those of the one-sample f test in 
Sect. 5.4, except that calculation is made using the z table rather than a 
t distribution. 
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ONE-PROPORTION Z TEST 
Consider testing the null hypothesis Ho: p = po based on a random sample of size 
n. Let P denote the proportion of “successes” in the sample. The test statistic is 


ip — Po 
VPo(1 = po)/n 


Provided npo = 10 and n(1 — po) = 10, Z has approximately a standard nor- 
mal distribution when Ho is true. Let z denote the calculated value of the test 
statistic. The calculation of the P-value depends on the choice of H, as follows: 


Alternative Hypothesis P-value 


,: p> Po 1—@(z) 
H,: p <Po D(z) 
Ha P#Po 2[1 — B(2l)} 


Illustrations of these P-values are essentially identical to those in Fig. 5.10. 


Example 5.22 Obesity is an increasing problem in America among all age groups. 
The Centers for Disease Control and Prevention (CDCP) reported in 2012 that 
35.7% of US adults are obese (a body mass index exceeding 30; this index is a 
measure of weight relative to height). Physicians at a large hospital in Los Angeles 
measured the body mass index of 122 randomly selected patients and found that 
38 of them should be classified as obese. Do the hospital’s data suggest that the true 
proportion of adults served by this hospital who are obese is less than the national 
figure of 35.7%? Let’s carry out a test of hypotheses using a= .05. 

The parameter of interest is p=the proportion of all adults served by this 
hospital who are obese. The competing hypotheses are 


Ho: p= .357 (the hospital’s obesity rate matches the national rate) 
H,: p < .357 (the hospital’s obesity rate is less than the national rate) 
Since npo = 122(.357) = 43.6 > 10 and n(1 — po) = 1221 — .357) = 78.4 > 10, 
the one-proportion z test may be applied. With p = 38/122 = .311, the calculated 
value of the test statistic is 


= bp — Dp _ 311 — 357 _ 
V/po(l—po)/n  /.357(1 — .357)/122 


So, the observed sample proportion is about one standard deviation below what 
we would expect if the null hypothesis is true. The P-value of the test is the 
probability of obtaining a test statistic value at least that low: 


1.05 


P-value = P(Z < —1.05) = ®(—1.05) = .1469 
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Since the P-value of .1469 is greater than the significance level .05, we fail to 
reject Hp. On the basis of the observed data, we cannot conclude that the obesity rate 
of the population served by this hospital is less than the national rate of 35.7%. 


As was the case for inferences on y, it is desirable to calculate the power of a 
hypothesis test concerning a population proportion p. The power of our one-sample 
z test depends on how far the true value of p is from the null value po, the sample 
size, and the selected significance level. The details of such power calculations, 
which many software packages can perform automatically, are developed in 
Exercises 96 and 97 of this section. 

Inferences about p when n is small can be based directly on the binomial 
distribution. There are also procedures available for making inferences about a 
difference p; — p2 between two population proportions (e.g., the proportion of all 
female versus male students that make the honor roll at your school). Please consult 
the reference by Devore and Berk for more information. 


5.5.3 Software for Inferences about p 


The prop.test function in R will calculate the traditional CI (Eq. 5.4) for a 
population proportion and perform a one-proportion z test upon request. Figure 5.13 


a 
> prop.test (16,48) 


l-sample proportions test with continuity correction 


data: 16 out of 48, null probability 0.5 
X-squared = 4.6875, df = 1, p-value = 0.03038 
alternative hypothesis: true p is not equal to 0.5 
95 percent confidence interval: 

0.2080794 0.4851357 
sample estimates: 


Pp 
0.3333333 


b 


> prop.test (38,122,p=.357,"less") 
l-sample proportions test with continuity correction 


data: 38 out of 122, null probability 0.357 
X-squared = 0.9121, df = 1, p-value = 0.1698 
alternative hypothesis: true p is less than 0.357 
95 percent confidence interval: 

0.0000000 0.3881457 
sample estimates: 


Pp 
0.3114754 


Fig. 5.13 Inferences on p in R: (a) Example 5.21; (b) Example 5.22 
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shows output corresponding to Examples 5.21 and 5.22. The 95% CI in Fig. 5.13a 
for p is roughly (.208, .485); the difference between this interval and the traditional 
interval provided in Example 5.21 is due to rounding and an adjustment made 
automatically in R called Yates’ continuity correction. The inputs to prop. test 
in Fig. 5.13b include not only the raw data x and n, but also the null value of p and 
the direction of the test. The resulting P-value, .1698, is close to the value of .1469 
obtained in Example 5.22. Again, the disparity comes from a combination of 
rounding and the continuity correction. Unfortunately, to the authors’ knowledge, 
there are no one-proportion z intervals or z tests built into Matlab. 


5.5.4 Exercises: Section 5.5 (77-97) 


77. Ina sample of 1000 randomly selected consumers who had opportunities to 
send in a rebate claim form after purchasing a product, 250 of these people 
said they never did so (““Rebates: Get What You Deserve,” Consumer Reports, 
May 2009: 7). Reasons cited for their behavior included too many steps in the 
process, amount too small, missed deadline, fear of being placed on a mailing 
list, lost receipt, and doubts about receiving the money. Calculate and inter- 
pret a 95% confidence level for the true proportion of such consumers who 
never apply for a rebate. 

78. A Wireless News article (July 6, 2008) found that 62% of people surveyed 
would use a Bluetooth device while driving in order to comply with the law. 
The survey was based upon a random sample of 600 cell phone users. 
Construct a 95% confidence interval for the proportion of all cell phone 
users who will use Bluetooth technology while driving. 

79. The article “Limited Yield Estimation for Visual Defect Sources” (IEEE 
Trans. Semicon. Manuf., 1997: 17-23) reported that, in a study of a particular 
wafer inspection process, 356 dies were examined by an inspection probe and 
201 of these passed the probe. Assuming a stable process, calculate a 99% 
confidence interval for the proportion of all dies that pass the probe. 

80. The technology underlying hip replacements has changed as these operations 
have become more popular (over 250,000 in the USA in 2008). Starting in 
2003, highly durable ceramic hips were marketed. Unfortunately, for too 
many patients the increased durability has been counterbalanced by an 
increased incidence of squeaking. The May 11, 2008, issue of the New York 
Times reported that in one study of 143 individuals who received ceramic hips 
between 2003 and 2005, 10 of the hips developed squeaking. Calculate and 
interpret 95% confidence interval for the true proportion of such hips that 
develop squeaking. 

81. The Pew Forum on Religion and Public Life reported on December 9, 2009, 
that in a survey of 2003 American adults, 25% said they believed in astrology. 
Calculate and interpret a confidence interval at the 99% confidence level for 
the proportion of all adult Americans who believe in astrology. 
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Reconsider the score CI (Eq. 5.3) for p, and focus on a confidence level of 
95%. Show that the endpoints agree quite well with those of the traditional 
interval (Eq. 5.4) once two successes and two failures have been appended to 
the sample, i.e., Eq. (5.4) based on (x + 2) S’s in (n + 4) trials. [Hint: 1.96 = 2.] 
It is often important in planning studies to know in advance what sample size is 
required to estimate an unknown proportion to within a certain margin of error. 
(a) Suppose we wish to achieve a bound B on the margin of error of a CI for p. 
By equating the margin of error in the “very large n” CI Eq. (5.4) to B and 
solving for n, show that the required sample size is 


(z*)°p (1 —p) 


n= RB 


(b) A state legislator wishes to survey residents of her district to see what 
proportion of the electorate is aware of her position on using state funds to 
pay for abortions. If the legislator has strong reason to believe that at least 
2/3 of the electorate know of her position, how large a sample size would 
you recommend in order to estimate the true proportion to within + 5 
percentage points? Assume 95% confidence. 

(c) What sample size is necessary if the 95% CI for p is to have width of at 
most .10 irrespective of p? [Hint: What value of p makes the expression 
for n as large as possible? ] 

Write a function in Matlab or R to implement (Eq. 5.3). Your function should 

have three inputs: the number of successes x, the sample size n, and the 

desired confidence level. The output of the function should be the endpoints 
of the CI. 

Natural cork in wine bottles is subject to deterioration, and as a result wine in 

such bottles may experience contamination. The article “Effects of Bottle 

Closure Type on Consumer Perceptions of Wine Quality” (Amer. J. of Enol- 

ogy and Viticulture, 2007: 182-191) reported that, in a tasting of commercial 

chardonnays, 16 of 91 bottles were considered spoiled to some extent by cork- 
associated characteristics. Does this data provide strong evidence for conclud- 
ing that more than 15% of all such bottles are contaminated in this way? Carry 

out a test of hypotheses using a significance level of .10. 

It is known that roughly 2/3 of all human beings have a dominant right foot or 

eye. Is there also right-sided dominance in kissing behavior? The article 

“Human Behavior: Adult Persistence of Head-Turning Asymmetry” (Nature, 

2003: 771) reported that in a random sample of 124 kissing couples, both 

people in 80 of the couples tended to lean more to the right than to the left. 

Does the result of the experiment suggest that the 2/3 figure is implausible for 

kissing behavior? State and test the appropriate hypotheses. 

The article referenced in Exercise 85 also reported that in a sample of 

106 wine consumers, 22 (20.8%) thought that screw tops were an acceptable 

substitute for natural corks. Suppose a particular winery decided to use screw 
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tops for one of its wines unless there was strong evidence to suggest that fewer 
than 25% of wine consumers found this acceptable. 
(a) Using a significance level of .10, what would you recommend to the 
winery? 
(b) For the hypotheses tested in (a), describe in context what the Type I and II 
errors would be, and say which type of error might have been committed. 
With domestic sources of building supplies running low several years ago, 
roughly 60,000 homes were built with imported Chinese drywall. According 
to the article “Report Links Chinese Drywall to Home Problems” (New York 
Times, November 24, 2009), federal investigators identified a strong associa- 
tion between chemicals in the drywall and electrical problems, and there is 
also strong evidence of respiratory difficulties due to the emission of hydrogen 
sulfide gas. An extensive examination of 51 homes found that 41 had such 
problems. Suppose these 51 were randomly sampled from the population of 
all homes having Chinese drywall. Does the data provide strong evidence for 
concluding that more than 50% of all homes with Chinese drywall have 
electrical/environmental problems? Carry out a test of hypotheses using 
a=.01. 
A common characterization of obese individuals is that their body mass index 
is at least 30 [BMI = weight/(height)” when height is in meters and weight is 
in kilograms]. The article “The Impact of Obesity on Illness Absence and 
Productivity in an Industrial Population of Petrochemical Workers” (Annals 
of Epidemiology, 2008: 8-14) reported that in a sample of female workers, 
262 had BMIs of less than 25, 159 had BMIs that were at least 25 but less than 
30, and 120 had BMIs exceeding 30. Is there compelling evidence for 
concluding that more than 20% of the individuals in the sampled population 
are obese? 
(a) State and test appropriate hypotheses using a significance level of .05. 
(b) Explain in the context of this scenario what constitutes Type I and II errors. 
The article “Analysis of Reserve and Regular Bottlings: Why Pay for a 
Difference Only the Critics Claim to Notice?” (Chance, Summer 2005, 
pp. 9-15) reported on an experiment to investigate whether wine tasters 
could distinguish between more expensive reserve wines and their regular 
counterparts. Wine was presented to tasters in four containers labeled A, B, C, 
and D, with two of these containing the reserve wine and the other two the 
regular wine. Each taster randomly selected three of the containers, tasted the 
selected wines, and indicated which of the three he/she believed was different 
from the other two. Of the n= 855 tasting trials, 346 resulted in correct 
distinctions (either the one reserve that differed from the two regular wines 
or the one regular wine that differed from the two reserves). Does this provide 
compelling evidence for concluding that tasters of this type have some ability 
to distinguish between reserve and regular wines? State and test the relevant 
hypotheses. Are you particularly impressed with the ability of tasters to 
distinguish between the two types of wine? 
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. The article “Heavy Drinking and Polydrug Use Among College Students” 
(J. of Drug Issues, 2008: 445-466) stated that 51 of the 462 college students in 
a sample had a lifetime abstinence from alcohol. Does this provide strong 
evidence for concluding that more than 10% of the population sampled had 
completely abstained from alcohol use? Test the appropriate hypotheses. 
[Note: The article used more advanced statistical methods to study the use 
of various drugs among students characterized as light, moderate, and heavy 
drinkers. | 
Scientists have recently become concerned about the safety of Teflon cook- 
ware and various food containers because perfluorooctanoic acid (PFOA) is 
used in the manufacturing process. An article in the July 27, 2005, New York 
Times reported that of 600 children tested, 96% had PFOA in their blood. 
According to the FDA, 90% of all Americans have PFOA in their blood. Does 
the data on PFOA incidence among children suggest that the percentage of all 
children who have PFOA in their blood exceeds the FDA percentage for all 
Americans? Carry out an appropriate test of hypotheses at the a= .05 level. 
A manufacturer of nickel—hydrogen batteries randomly selects 100 nickel 
plates for test cells, cycles them a specified number of times, and determines 
that 14 of the plates have blistered. Does this provide compelling evidence for 
concluding that more than 10% of all plates blister under such circumstances? 
State and test the appropriate hypotheses using a significance level of .05. In 
reaching your conclusion, what type of error might you have committed? 
A random sample of 150 recent donations at a blood bank reveals that 82 were 
type A blood. Does this suggest that the actual percentage of type A donations 
differs from 40%, the percentage of the population having type A blood? 
Carry out a test of the appropriate hypotheses using a significance level of .01. 
Would your conclusion have been different if a significance level of .05 had 
been used? 
The article “Statistical Evidence of Discrimination” (J. Amer. Statist. Assoc., 
1982: 773-783) discusses the court case Swain v. Alabama (1965), in which it 
was alleged that there was discrimination against blacks in grand jury selec- 
tion. Census data suggested that 25% of those eligible for grand jury service 
were black, yet a random sample of 1050 people called to appear for possible 
duty yielded only 177 blacks. Using a level .01 test, does this data argue 
strongly for a conclusion of discrimination? 
Consider testing hypotheses Hp: p= po versus H,: p < po. Suppose that, in 
fact, the true value of the parameter p is p’, where p’ < po (so H, is true). 
(a) Show that the expected value and variance of the test statistic Z in the 
one-proportion z test are 


E(Z) = p' — Po Var(Z) = pl —p')/n 


Po(1 — po)/n Po(1 — po)/n 
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(b) It can be shown that P-value < a iff Z<—z,, where —z, denotes the a 
quantile of the standard normal distribution (i.e., ®(—z,) = a). Show that 
the power of the lower-tailed one-sample z test when p= p’ is given by 


© ( — p' —2ar/po(1 pal) 


p(1—p')/n 


(c) A package-delivery service advertises that at least 90% of all packages 
brought to its office by 9 a.m. for delivery in the same city are delivered by 
noon that day. Let p denote the true proportion of such packages that are 
delivered as advertised and consider the null hypotheses Ho: p = .9 versus 
the alternative H,: p< .9. If only 80% of the packages are delivered as 
advertised, how likely is it that a level .01 test based on n = 225 packages 
will detect such a departure from Ho? 


97. Because of variability in the manufacturing process, the actual yielding point 
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of a sample of mild steel subjected to increasing stress will usually differ from 

the theoretical yielding point. Let p denote the true proportion of samples that 

yield before their theoretical yielding point. If on the basis of a sample it can 
be concluded that more than 20% of all specimens yield before the theoretical 
point, the production process will have to be modified. 

(a) If 15 of 60 specimens yield before the theoretical point, what is the P- 
value when the appropriate test is used, and what would you advise the 
company to do? 

(b) If the true percentage of “early yields” is actually 50% (so that the 
theoretical point is the median of the yield distribution) and a level .01 
test is used, what is the probability that the company concludes a modifi- 
cation of the process is necessary? [Hint: Refer back to the previous 
exercise. Modify the expression in part (b) to accommodate an upper- 
tailed test.] 


Bayesian Inference 


Throughout this chapter, we have regarded parameters such as pw, o, p, and A as 
having an unknown but single, fixed value. This is often referred to as the classical 
or frequentist approach to statistical inference. However, there is a different para- 
digm, called subjective or Bayesian inference, in which an unknown parameter is 
assigned a distribution of possible values, analogous to a probability distribution. 
This distribution reflects all available information—past experience, intuition, 
common sense—about the parameter prior to observing the data. For this reason, 
it is called the prior distribution of the parameter. 
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DEFINITION 

A prior distribution for a parameter 0, denoted 2(0), is a probability distri- 
bution on the set of possible values for @. In particular, if the possible values 
of the parameter @ form an interval /, then m(@) is a pdf that must satisfy 


/ x(6)d0 = 1 


Similarly, if @ is potentially any value in a discrete set D, then x(@) is a pmf 
that must satisfy 


D.C) 


Example 5.23 Consider the parameter =the mean GPA of all students at your 
university. Since GPAs are always between 0.0 and 4.0, # must also lie in this 
interval. But common sense tells you that yz is unlikely to be below 2.0, or very few 
people would graduate, and it would be likewise surprising to find «7 much above 
3.5. This “prior belief” can be expressed mathematically as a prior distribution for ys 
on the interval = [0, 4]. If our best guess a priori is that #7 2.5, then our prior 
distribution m(4) should be centered around 2.5. The variability of the prior distri- 
bution we select should reflect how sure we feel about our initial information. 

If we feel very sure that y is near 2.5, then we should select a prior distribution 
for y that has less variation around that value. On the other hand, if we are less 
certain, this can be reflected by a prior distribution with much greater variability. 
Figure 5.14 illustrates these two cases. 
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Fig. 5.14 Two prior distributions for a parameter: a more diffuse prior (less certainty) and a more 
concentrated prior (more certainty) a 
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5.6.1. The Posterior Distribution of a Parameter 


The key to Bayesian inference is having a mathematically rigorous way to incorpo- 
rate the actual sample data. Suppose we observe values x), ..., x, from a distribu- 
tion depending on the unknown parameter 0 for which we have selected some prior 
distribution. Then a Bayesian statistician wants to “update” her or his belief about 
the distribution of 0, taking into account both prior belief and the observed x;s. This 
is achieved using a form of Bayes’ Theorem for random variables. 


DEFINITION 
Suppose Xj, ..., X,, have joint pdf f(x, .. ., X,; 9) and the unknown parameter 
@ has been assigned a continuous prior distribution (0). Then the posterior 


distribution of 0, given the observations X; =x), ..., X,=Xn, 18 
6 oan pans) 
n(0 | x1, meee = za | Aree) (5.5) 
J #OPC1,.-. x 0)d9 


The integral in the denominator of Eq. (5.5) insures that the posterior 
distribution is a valid probability density for 0. 

If X;, ..., X, are discrete, the joint pdf is replaced by their joint pmf and 
integration by summation. 


Notice that constructing the posterior distribution of a parameter requires a 
specific probability model f(x, ..., x,; @) for the observed data. In Example 5.23, 
it would not be enough to simply observe the GPAs of a random sample of 
n students; one must specify the underlying distribution, with mean yw, from 
which those GPAs are drawn. 


Example 5.24 Emissions of subatomic particles from a radiation source are often 
modeled as a Poisson process. As we shall see in Chap. 7, this implies that the time 
between successive emissions follows an exponential distribution. In practice, the 
parameter A of this distribution is typically unknown. If researchers believe a priori 
that the average time between emissions is about half a second, so A= 2, a prior 
distribution with a mean around 2 might be selected for A. One example is the 
following gamma distribution, which has mean (and variance) of 2: 


n(A) = de“, A>O 


Notice the gamma distribution lies on the interval (0, 00), which is also the set of 
possible values for the unknown parameter J. 

The times X,, ..., X5 between five particle emissions will be recorded; it is these 
variables that have an exponential distribution with the unknown parameter 4 
(equivalently, mean 1/2). Because the X;s are also independent, their joint pdf is 
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F615 5%5)4) = FCA) Fes) ) = Ae de = Be 


Applying Eq. (5.5) with these two components, the posterior distribution of 4 
given the observed data is 


ai “= MAY, ¥S3A) de? fe PS 
/ n(A)f(x1,...,%5;A)dA | Mee gh 
—oo 0 
J e-AI+2%i) 


ip tecoma 
0 


Suppose the five observed inter-emission times are x, =0.66, x2.=0.48, 
x3 = 0.44, x4 =0.71, x5 =0.56. The sum of these five times is > X;= 2.85, and so 
the posterior distribution simplifies to 


Jo e-3 85 = 3.857 9 e- 3.854 


[ Weo-3854q 6! 
0 


The integral in the denominator was evaluated using the gamma integral formula 
(3.5) from Chap. 3; as noted previously, the purpose of this integral is to guarantee 
that the posterior distribution of / is a valid probability density. As a function of 4, 
we recognize this as a gamma distribution with parameters a=7 and f= 1/3.85. 
The prior and posterior density curves of 4 appear in Fig. 5.15. 


n(A | 0.66, ...,0.56) = 
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Fig. 5.15 Prior and posterior distribution of A for Example 5.24 a 
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Example 5.25 A 2010 National Science Foundation study found that 488 out of 
939 surveyed adults incorrectly believe that antibiotics kill viruses (they only kill 
bacteria). Let 6 denote the proportion of al/ US adults that hold this mistaken view. 
Imagine that an NSF researcher, in advance of administering the survey, believed 
(hoped?) the value of 8 was roughly | in 3, but he was very uncertain about this 
belief. Since any proportion must lie between 0 and 1, the beta family of 
distributions from Sect. 3.5 provides a natural source of priors for 6. One such 
beta distribution, with an expected value of 1/3, is the Beta(2, 4) model whose pdf is 


n(0) = 200(1—0)> 0<0<1 


The data mentioned at the beginning of the example can be considered either a 
random sample of size 939 from the Bernoulli distribution or, equivalently, a single 
observation from the binomial distribution with n= 939. Let X =the number of 
US adults in a random sample of 939 that believe antibiotics kill viruses. Then 
X ~Bin(939, 0), and the pmf of X is p(x; 0) = (°”)@*(1 — @)”” ™. Substituting the 


observed value x = 488, Eq. (5.5) gives the posterior distribution of 0 as 


200(1 —6)3- Oya ay! 


(0) p(488; 0) 488 


x(0)p(488;0)d0— "206(1 —6)>- 939 \ g888 (1 — 9) a9 
i (ass) 


488 
89/4 _ 9454 
ee =c-6°(1-0)* 0<0<1 


1 
| (1 — 9)***d0 
0 


n(0 | X = 488) = 


Recall that the constant c, which equals the reciprocal of the integral in the 
denominator, serves to insure that the posterior distribution 1(O1X = 488) integrates 
to 1. Rather than evaluate the integral, we can simply recognize the expression 
6**°(1 — @)** as a standard beta distribution, specifically with parameters a= 490 
and #=455, that’s just missing the constant of integration in front. It follows that 
the posterior distribution of 6 given X = 488 must be Beta(490, 455); if we require c, 
it can be copied directly from the beta pdf. (This trick comes in handy quite often in 
Bayesian statistics: if we can recognize a posterior distribution as being the 
“kernel” of a particular probability distribution, then it must necessarily be that 
distribution.) 

The prior and posterior density curves for 6 are displayed in Fig. 5.16. While the 
prior distribution is centered around 1/3 and exhibits a great deal of uncertainty 
(variability), the posterior distribution of 0 is centered much closer to the sample 
proportion of incorrect answers, 488/939 = .52, with considerably less uncertainty. 
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Fig. 5.16 Density curves for the parameter 0 in Example 5.25: (a) prior Beta(2, 4); (b) posterior 
Beta(490, 455) a 


5.6.2. Inferences from the Posterior Distribution 


Inferences about an unknown parameter can be obtained from its posterior distri- 
bution. The most common Bayesian point estimate for a parameter 0 is the mean of 
its posterior distribution: 


6 =E(0| x1, siting) 


An interval [a, b] having posterior probability .95 gives a 95% credibility 
interval, the Bayesian analogue of a 95% confidence interval (but the interpretation 
is different). Typically one selects the middle 95% of the posterior distribution, 1.e., 
the endpoints of a 95% credibility interval are ordinarily the .025 and .975 quantiles 
of the posterior distribution. 


Example 5.26 (Example 5.24 continued) Given the observed values of X), ..., Xs, 
we previously found that the mean emission rate 1 has a Gamma(7, 1/3.85) poste- 
rior distribution. Thus, the mean of the posterior distribution of 4 is 


A = E(A| 0.66, ...,0.56) = af = 7(1/3.85) = 1.82 


This isn’t too different from the researchers’ prior belief that 4% 2. A 95% 
credibility interval for 2 requires determining the .025 and .975 quantiles of the 
Gamma(7, 1/3.85) model; using statistical software, 1925=0.7310 and 
11.975 = 3.3921. Under the Bayesian interpretation, having observed the five afore- 
mentioned inter-emission times, there is a 95% posterior probability that 2 is 
between 0.7310 and 3.3921 emissions per second. a 
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Example 5.27 (Example 5.25 continued) The posterior distribution of the param- 
eter 9=the proportion of all US adults that incorrectly believe antibiotics kill 
viruses was found to have a Beta(490, 455) distribution. A point estimate of 0 is 
the mean of this distribution: 


a 490 490 
a+p 490+455 945 


= 5185 


6 = E(0 | X = 488) 


Notice this is quite close to the traditional estimate x/n = 488/939 = .5197; in 
general, when n is large the mean of the posterior distribution of a parameter will be 
quite similar to its more traditional, frequentist estimate. 

The .025 and .975 quantiles of this beta distribution are 7925=.4866 and 
1.975 = 5503. So, after observing the results of the NSF survey, there is a 95% 
posterior probability that @ is between .4866 and .5503. = 


5.6.3 Further Comments on Bayesian Inference 


In most cases, the role of the observed values in shaping the posterior distribution of 
a parameter @ increases as the sample size n increases. More precisely, it can be 
shown that under very general conditions, as n— oo the mean of the posterior 
distribution will converge to the true value of @ while the variance of the posterior 
distribution of 8 converges to zero: 


E(@|X1,...,Xn) +90  Var(0| Xi, ...,Xn) +0 


The second property manifests itself in our two previous examples: the 
variability of the posterior distribution of A based on n=5 observations was still 
rather substantial, while the posterior distribution of 6 based on a sample of size 
n= 939 was quite concentrated. 

Since traditional estimators such as P and X converge to the true values of 
corresponding parameters (e.g., p or #) by the Law of Large Numbers, it follows 
that Bayesian and frequentist estimates will typically be quite close when n is large. 
This is true both for the point estimates and the interval estimates. But when n is 
small—a common occurrence in Bayesian methodology—parameter estimates can 
differ drastically between the two methods. This is especially true if the 
researcher’s prior belief is very far from what’s actually true (e.g., believing a 
proportion is around 1/3 when it’s really more than .5). 

It should be emphasized that, even if the confidence interval is nearly the same as 
the credibility interval for a parameter, they have different interpretations. To 
interpret the Bayesian credibility interval, we say that there is a 95% probability 
that the parameter @ is in the interval. However, for the frequentist confidence 
interval such a probability statement does not make sense: as we discussed in 
Sect. 5.3, neither the parameter @ nor the endpoints of the interval are considered 
random under the traditional, frequentist view. 
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In the examples of this section, prior distributions were chosen partially by 
matching the mean of a distribution to someone’s a priori “best guess” about the 
value of the parameter. We also mentioned at the beginning of the section that the 
variability of the prior distribution often reflects the strength of that belief. In 
practice, there is a third consideration for choosing a prior distribution: the ability 
to apply Eq. (5.5) in a simple fashion. Ideally, we would like to choose a prior 
distribution from a family (gamma, beta, etc.) such that the posterior distribution is 
from that same family. When this happens we say that the prior distribution is 
conjugate to the data distribution. 

In Example 5.24, the prior 2(A) is the Gamma(2, 1) pdf; we determined, using 
Eq. (5.5), that the posterior distribution was Gamma(7, 1/3.85). It can be shown in 
general (Exercise 104) that any gamma distribution is conjugate to an exponential 
data distribution. Similarly, the prior and posterior distributions of 8 in Example 
5.25 were Beta(2, 4) and Beta(490, 455), respectively. Exercise 105 shows that any 
beta distribution is conjugate to a binomial (or Bernoulli) data distribution. If the 
data is normally distributed with known o, then a normal prior for y results in a 
normal posterior. The case of unknown o is more complicated; see Section 14.4 of 
the reference by Devore & Berk. 


5.6.4 Exercises: Section 5.6 (98-106) 


98. Nationwide, IQs have a normal distribution with mean 100 and standard 
deviation 15. Let X,, ..., X, represent the IQs of a random sample of first 
graders, which we assume also come from a normal distribution having o = 15 
but possibly a different mean yz. Assign a N(110, 7.5) prior distribution to py. 
(a) Find the posterior distribution of py. 

(b) Here are the actual IQ scores of a random sample of n= 18 first graders: 


113 108 140 113 115 146 136 107 108 
119 132 127 118 108 103 103 122 111 


Calculate a point estimate of y using the posterior distribution. 

(c) Calculate and interpret 95% credibility interval for ju. 

(d) Calculate a one-sample z 95% confidence interval for pw using the 
18 observations with o= 15, and compare with the credibility interval 
of (b). 

99. The number of customers arriving during a |-h period at an ice cream shop is 
modeled by a Poisson distribution with unknown parameter y. Based on past 
experience, the owner believes that the average number of customers in | h is 
about 15. 

(a) Assign a prior to » from the gamma family of distributions, such that the 
mean of the prior is 15 and the standard deviation is 5 (reflecting moderate 
uncertainty). 

(b) The number of customers in ten randomly selected 1-h intervals is 
recorded: 
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100. 


101. 


102. 


103. 


104. 


16 9 11 13 17 17 8 15 14 16 


Find the posterior distribution of yp. 

(c) Find and interpret a 95% credibility interval for y. 

In a study of 70 restaurant bills, 40 of the 70 were paid using cash. Let 

p denote the population proportion paying cash. 

(a) Assuming a beta prior distribution for p with a=2 and / =2, obtain the 
posterior distribution of p. 

(b) Repeat (a) on with a and f positive and close to 0. 

(c) Calculate a 95% credibility interval for p using (b). Is your interval 
compatible with p= .5? 

(d) Calculate a 95% confidence interval for p using Eq. (5.3), and compare 
with the result of (c). 

(e) Compare the interpretations of the credibility interval and the confidence 
interval. 

(f) Based on the prior in (b), test the hypothesis p <.5 using the posterior 
distribution to find P(p <.5). 

For the data of Example 5.25 assume a Beta(2, 4) prior distribution and 

assume that the 939 observations are a random sample from the Bernoulli 

distribution. Use Eq. (5.5) to derive the posterior distribution, and compare 

your answer with the result of Example 5.25. 

Laplace’ s rule of succession says that if there have been n Bernoulli trials and 

they have all been successes, then the probability of a success on the next trial 

is (n+ 1)/(n +2). For the derivation, Laplace used a Beta(1, 1) prior for the 

parameter p. 

(a) Show that, if a Beta(1, 1) prior is assigned to p and there are n successes in 
n trials, then the posterior mean of p is (n+ 1)/(n + 2). 

(b) Explain (a) in terms of total successes and failures; that is, explain the 
result in terms of two prior trials plus n later trials. 

(c) Laplace applied his rule of succession to compute the probability that the 
sun will rise tomorrow using 5000 years, or n= 1,826,214 days of history 
in which the sun rose every day. Is Laplace’s method equivalent to 
including two prior days when the sun rose once and failed to rise once? 
Criticize the answer in terms of total successes and failures. 

Suppose you have a random sample X,, X2, ..., X, from the Poisson distribu- 

tion with mean yw. If the prior distribution for « has a gamma distribution with 

parameters a and /, show that the posterior distribution is also gamma 
distributed. What are its parameters? 

Suppose you have a random sample Xj, X2, ..., X, from the exponential 

distribution with parameter J. If the prior distribution for A has a gamma 

distribution with parameters a and /, show that the posterior distribution is 
also gamma distributed. What are its parameters? 
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Suppose X ~ Bin(n, p), where the probability parameter p is unknown. If the 
prior distribution for p has a beta distribution with parameters a and /, show 
that the posterior distribution is also beta distributed. What are its parameters? 
Consider a random sample X,, X>, ..., X, from the normal distribution with 
mean 0 and variance o” = 1/c. (The parameter t= 1 |o* is called the precision 
of the normal distribution.) Assume a gamma-distributed prior for t and show 
that the posterior distribution of 7 is also gamma. What are its parameters? 


Supplementary Exercises (107-138) 


At time t=0, there is one individual alive in a certain population. A pure 
birth process then unfolds as follows. The time until the first birth is expo- 
nentially distributed with parameter 2. After the first birth, there are two 
individuals alive. The time until the first gives birth again is exponential 
with parameter A, and similarly for the second individual. Therefore, the 
time until the next birth is the minimum of two exponential (A) variables, 
which is exponential with parameter 24. Similarly, once the second birth has 
occurred, there are three individuals alive, so the time until the next birth is an 
exponential rv with parameter 3A, and so on (the memoryless property of the 
exponential distribution is being used here). Suppose the process is observed 
until the sixth birth has occurred and the successive birth times are 25.2, 41.7, 
51.2, 55.5, 59.5, 61.8 (from which you should calculate the times between 
successive births). Derive the mle of A. [Hint: The likelihood is a product of 
exponential terms. ] 

When the sample standard deviation S$ is based on a random sample from a 
normal population distribution, it can be shown that 


E(S) = V2/(n— 1)P(n/2)o/P[(n — 1)/2| 


Use this to obtain an unbiased estimator for o of the form cS. What is c when 

n=20? 

Each of n specimens is to be weighed twice on the same scale. Let X; and Y; 

denote the two observed weights for the ith specimen. Suppose X; and Y; are 

independent of each other, each normally distributed with mean value py; (the 

true weight of specimen 7) and variance o. 

(a) Show that the mle of o? is 62 = 5>(X;—Y;)?/(4n)  [Hint: If 
Z = (4 +2)/2, then } (z; — 2)? = (4 — 9)" /2.] 

(b) Is the mle 6? an unbiased estimator of o”? Find an unbiased estimator of 
oO. [Hint: For any rv Z, E(Z’) = Var(Z) + [E(Z)/’. Apply this to 
Z=X;-Y;.] 

The Principle of Unbiased Estimation has been criticized on the grounds that 

in some situations the only unbiased estimator is patently ridiculous. Here is 

one such example. Suppose that the number of major defects X on a randomly 
selected vehicle has a Poisson distribution with parameter . You are going to 
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purchase two such vehicles and wish to estimate 0 = P(X, = 0, X,=0) eH, 
the probability that neither of these vehicles has any major defects. Your 
estimate is based on observing the value of X for a single vehicle. Denote this 
estimator by 6 = g(X). Write the equation implied by the condition of 
unbiasedness, E[g(X)] =e °, cancel e “ from both sides, then expand what 
remains on the right-hand side in an infinite series, and compare the two sides 
to determine g(X). If X = 200, what is the estimate? Does this seem reason- 
able? What is the estimate if X = 199? Is this reasonable? 

Let X, the payoff from playing a certain game, have pmf 


. — 0 Pal 
pi) =4 4 2 gry x = 0,1, 2, eer 


(a) Verify that p(x; @) is a legitimate pmf, and determine the expected payoff. 
[Hint: Look back at the properties of a geometric random variable 
discussed in Chap. 2.] 

(b) Let Xj, ..., X, be the payoffs from n independent games of this type. 
Determine the mle of @. [Hint: Let Y denote the number of observations 
among the n that equal —1, and write the likelihood as a single expression 
in terms of )\x; and y.] 

The reaction time (RT) to a stimulus is the interval of time commencing with 

stimulus presentation and ending with the first discernible movement of a 

certain type. The article “Relationship of Reaction Time and Movement Time 

in a Gross Motor Skill” (Percept. Motor Skills, 1973: 453-454) reports that 
the sample average RT for 16 experienced swimmers to a pistol start was 

.214 s and the sample standard deviation was .036 s. Making any necessary 

assumptions, derive a 90% CI for true average RT for all experienced 

swimmers. 

For each of 18 preserved cores from oil-wet carbonate reservoirs, the amount 

of residual gas saturation after a solvent injection was measured at water 

flood-out. Observations, in percentage of pore volume, were 


23.5 31.5 34.0 46.7 45.6 325 
41.4 37.2 42.5 46.9 51.5 36.4 
44.5 35.7 33.5 39.3 22.0 51.2 


(See “Relative Permeability Studies of Gas-Water Flow Following Solvent 
Injection in Carbonate Rocks,” Soc. Petrol. Eng. J., 1976: 23-30.) 
(a) Is it plausible that the sample was selected from a normal population 

distribution? 

(b) Calculate a 98% CI for the true average amount of residual gas saturation. 
Aphid infestation of fruit trees can be controlled either by spraying with 
pesticide or by inundation with ladybugs. In a particular area, four different 
groves of fruit trees are selected for experimentation. The first three groves are 
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sprayed with pesticides 1, 2, and 3, respectively, and the fourth is treated with 
ladybugs, with the following results on yield: 


Treatment n; (number of trees) x; (bushels/tree) Sj 
1 100 10.5 ile) 
2} 90 10.0 1.3 
3 100 10.1 1.8 
4 120 10.7 1.6 


Let y;= the true average yield (bushels/tree) after receiving the ith treat- 
ment. Then 


1 
O= 3 (mM + Hy + 3) — Ha 


measures the difference in true average yields between treatment with 
pesticides and treatment with ladybugs. When 7, n2, n3, and ny are all 


large, the estimator 6 obtained by replacing each pu; by X; is approximately 
normal. Use this to derive a large-sample 100(1 — a)% CI for 6, and compute 
the 95% interval for the given data. 

It is important that face masks used by firefighters be able to withstand high 
temperatures because firefighters commonly work in temperatures of 200- 
500 °F. In a test of one type of mask, 11 of 55 masks had lenses pop out at 
250°. Construct a 90% CI for the true proportion of masks of this type whose 
lenses would pop out at 250°. 

A journal article reports that a sample of size 5 was used as a basis for 
calculating a 95% CI for the true average natural frequency (Hz) of 
delaminated beams of a certain type. The resulting interval was (229.764, 
233.504). You decide that a confidence level of 99% is more appropriate than 
the 95% level used. What are the limits of the 99% interval? [Hint: Use the 
center of the interval and its width to determine X and s.] 

Chronic exposure to asbestos fiber is a well-known health hazard. The article 
“The Acute Effects of Chrysotile Asbestos Exposure on Lung Function” 
(Envir. Res., 1978: 360-372) reports results of a study based on a sample of 
construction workers who had been exposed to asbestos over a prolonged 
period. Among the data given in the article were the following (ordered) 
values of pulmonary compliance (cm?/em H30) for each of 16 subjects 
8 months after the exposure period (pulmonary compliance is a measure of 
lung elasticity, or how effectively the lungs are able to inhale and exhale): 


167.9 180.8 184.8 189.8 194.8 200.2 
201.9 206.9 207.2 208.4 226.3 22161 
228.5 232.4 239.8 258.6 


(a) Is it plausible that the population distribution is normal? 
(b) Compute a 95% CI for the true average pulmonary compliance after such 
exposure. 
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A triathlon consisting of swimming, cycling, and running is one of the more 
strenuous amateur sporting events. The article “Cardiovascular and Thermal 
Response of Triathlon Performance” (Medicine and Science in Sports and 
Exercise, 1988: 385-389) reports on a research study involving nine male 
triathletes. Maximum heart rate (beats/min) was recorded during perfor- 
mance of each of the three events. For swimming, the sample mean and 
sample standard deviation were 188.0 and 7.2, respectively. Assuming that 
the heart-rate distribution is (approximately) normal, construct a 98% CI for 
true mean heart rate of triathletes while swimming. 
An April 2009 survey of 2253 American adults conducted by the Pew 
Research Center’s Internet & American Life Project revealed that 1262 of 
the respondents had at some point used wireless means for online access. 
(a) Calculate and interpret a 95% CI for the proportion of all American 
adults who at the time of the survey had used wireless means for online 
access. 

(b) What sample size is required if the desired width of the 95% C1 is to be at 
most .04, irrespective of the sample results? [Hint: See Exercise 83.] 
High concentration of the toxic element arsenic is all too common in 
groundwater. The article “Evaluation of Treatment Systems for the Removal 
of Arsenic from Groundwater” (Practice Periodical of Hazardous, Toxic, 
and Radioactive Waste Mgmt., 2005: 152—157) reported that for a sample of 
n=5 water specimens selected for treatment by coagulation, the sample 
mean arsenic concentration was 24.3 mg/L, and the sample standard devia- 
tion was 4.1. The authors of the cited article used t-based methods to analyze 
their data, so hopefully had reason to believe that the distribution of arsenic 

concentration was normal. 

(a) Calculate and interpret a 95% CI for true average arsenic concentration 
in all such water specimens. 

(b) Predict the arsenic concentration for a single water specimen in a way that 
conveys information about precision and reliability. (See Exercise 49.) 

Let 0, and @, denote the mean weights for animals of two different species. 

An investigator wishes to estimate the ratio 0/02. Unfortunately the species 

are extremely rare, so the estimate will be based on finding a single animal of 

each species. Let X; denote the weight of the species i animal (i= 1, 2), 

assumed to be normally distributed with mean 6; and standard deviation 1. 


(a) What is the distribution of the variable (@.X; — 0,X2)//0; + 65? Show 


that this variable depends on @, and @, only through 0,/0, (divide 
numerator and denominator by 62). 
(b) Since the variable in (a) is normally distributed, we have 
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(-1:6 < (02%, — 0X0) /\/O +02 < 1.96) = 95 


Now replace < by = and solve for @;/02. Then show that a confidence 
interval results if x7+x3>1.96°, whereas if this inequality is not 
satisfied, the resulting confidence set is the complement of an interval. 
Let X,, X2, ..., X, be a random sample from a uniform distribution on the 
interval [0, 0]. 
Then if Y=max(X;,), the techniques of Sect. 4.9 show that Y has density 
function 


O<y<@ 


0 otherwise 


(a) Use f(y) to verify that 
P|e-(a/2)'" <¥ <8-(1—a/2)'"| =1-a 


and use this to derive a 100(1 — a)% CI for 0. 

(b) Verify that P(@- all” <Y<0)=1-—a, and derive a 100(1 — a)% CI for 0 
based on this probability statement. 

(c) Which of the two intervals derived in (a) and (b) is shorter? If your waiting 
time for a morning bus is uniformly distributed and observed waiting 
times are x) = 4.2, x9 = 3.5, x3 = 1.7, x4 = 1.2, and x5 = 2.4, obtain a 95% 
CI for @ by using the shorter of the two intervals. 

Consider 95% Cls for two different parameters 0; and 62, and let A; ({= 1, 2) 
denote the event that the value of 0; is included in the random interval that 
results in the CI. Thus P(A;) = .95. 
(a) Suppose that the data on which the CI for 0; is based is independent of the 
data used to obtain the CI for 0, (e.g., we might have 6; = p, the popula- 
tion mean height for American females, and 02 =p, the proportion of all 
Kodak digital cameras that don’t need warranty service). What can be said 
about the simultaneous (i.e., joint) confidence level for the two intervals? 
That is, how confident can we be that the first interval contains the value of 
0, and that the second contains the value of 0? [Hint: Consider 
P(A, MAg).] 
Now suppose the data for the first CI is not independent of that for the 
second one. What now can be said about the simultaneous confidence 
level for both intervals? [Hint: Consider P(A, U Ad), the probability that at 
least one interval fails to include the value of what it is estimating. Now 
use the fact that P(A;U A>) < P(A;) + P(A) [why?] to show that the proba- 
bility that both random intervals include what they are estimating is at 
least .90. The generalization of the bound on P(A, UA2) to the probabil- 
ity of a k-fold union is one version of the Bonferroni inequality. ] 


(b 


ma 
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(c) What can be said about the simultaneous confidence level in (b) if the 
confidence level for each interval separately is 100(1 — a@)%? What can be 
said about the simultaneous confidence level if a 100(1—a)% CTI is 
computed separately for each of & parameters 0), ..., 0;? 

Let X,,..., X, be arandom sample from a continuous probability distribution 

having median yn (so that P(X;<7)=P(X;>n)=.5). Let Y, and Y,, be the 

smallest and largest order statistic for the sample (i.e., Yj =min(X;) and 

Y,, = max(X;)). 

(a) Show that 


1\""! 


so that (Y%, Y,) is a 100(1—a@)% confidence interval for 4 with 
a=(1/2)""'. [Hint: Use the same arguments employed in Sect. 4.9 to 
derive the cdfs of Y, and Y,,.] 

(b) For each of six normal male infants, the amount of the amino acid alanine 
(mg/100 mL) was determined while the infants were on an isoleucine-free 
diet, resulting in the following data: 


2.84 3.54 2.80 1.44 2.94 2.70 


Compute a 97% CI for the true median amount of alanine for infants on 
such a diet (“The Essential Amino Acid Requirements of Infants,” Amer. 
J. of Nutrition, 1964: 322-330). 

(c) Let Y, and Y,,_; denote the second-smallest and second-largest of the X;s, 
respectively. What is the confidence level of the interval (Y2, Y,,_) for 7? 

One method for straightening wire before coiling it to make a spring is called 

“roller straightening.” The article “The Effect of Roller and Spinner Wire 

Straightening on Coiling Performance and Wire Properties” (Springs, 1987: 

27-28) reports on the tensile properties of wire. Suppose a sample of 16 wires 

is selected and each is tested to determine tensile strength (N/mm”). The 

resulting sample mean and standard deviation are 2160 and 30, respectively. 

(a) The mean tensile strength for springs made using spinner straightening is 
2150 N/mm. What hypotheses should be tested to determine whether the 
mean tensile strength for the roller method exceeds 2150? 

(b) Assuming that the tensile strength distribution is approximately normal, 
what test statistic would you use to test the hypotheses in part (a)? 

(c) What is the value of the test statistic for this data? 

(d) What is the P-value for the value of the test statistic computed in part (c)? 

(e) For a level .05 test, what conclusion would you reach? 

A new method for measuring phosphorus levels in soil is described in the 

article “A Rapid Method to Determine Total Phosphorus in Soils” (Soil Sci. 

Amer. J., 1988: 1301-1304). Suppose a sample of 11 soil specimens, each 

with a true phosphorus content of 548 mg/kg, is analyzed using the new 
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method. The resulting sample mean and standard deviation for phosphorus 

level are 587 and 10, respectively. 

(a) Is there evidence that the mean phosphorus level reported by the new 
method differs significantly from the true value of 548 mg/kg? Use 
a=.05. 

(b) What assumptions must you make for the test in part (a) to be appropriate? 

The article “Orchard Floor Management Utilizing Soil-Applied Coal Dust for 

Frost Protection” (Agric. Forest Meteorol., 1988: 71-82) reports the follow- 

ing values for soil heat flux of eight plots covered with coal dust. 


34.7 35.4 34.7 37.7 32.5 28.0 18.4 24.9 


The mean soil heat flux for plots covered only with grass is 29.0. Assuming 
that the heat-flux distribution is approximately normal, does the data suggest 
that the coal dust is effective in increasing the mean heat flux over that for 
grass? Test the appropriate hypotheses using a= .05. 

The article “Caffeine Knowledge, Attitudes, and Consumption in Adult 

Women” (J. Nutrit. Ed., 1992: 179-184) reports the following summary 

data on daily caffeine consumption for a sample of adult women: n= 47, 

X = 215mg, s=235 mg, and range = 5-1176. 

(a) Does it appear plausible that the population distribution of daily caffeine 
consumption is normal? Is it necessary to assume a normal population 
distribution to test hypotheses about the value of the population mean 
consumption? Explain your reasoning. 

(b) Suppose it had previously been believed that mean consumption was at 
most 200 mg. Does the given data contradict this prior belief? Test the 
appropriate hypotheses at significance level .10. 

The accompanying observations on residual flame time (seconds) for strips of 

treated children’s nightwear were given in the article “An Introduction to 

Some Precision and Accuracy of Measurement Problems” (J. Test. Eval., 

1982: 132-140). Suppose a true average flame time of at most 9.75 had 

been mandated. Does the data suggest that this condition has not been met? 

Carry out an appropriate test after first investigating the plausibility of 

assumptions that underlie your method of inference. 


9.85 9.93 9.75 9.77 9.67 9.87 9.67 
9.94 9.85 9.75 9.83 9.92 9.74 9.99 
9.88 9.95 9.95 9.93 9.92 9.89 


The incidence of a certain type of chromosome defect in the US adult male 
population is believed to be 1 in 75. A random sample of 800 individuals in 
US penal institutions reveals 16 who have such defects. Can it be concluded 
that the incidence rate of this defect among prisoners differs from the pre- 
sumed rate for the entire adult male population? 

(a) State and test the relevant hypotheses using a= .05. 
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(b) What type of error might you have made in reaching a conclusion? 

In an investigation of the toxin produced by a certain poisonous snake, a 
researcher prepared 26 different vials, each containing | g of the toxin, and 
then determined the amount of antitoxin needed to neutralize the toxin. The 
sample average amount of antitoxin necessary was found to be 1.89 mg, and 
the sample standard deviation was .42. Previous research had indicated that 
the true average neutralizing amount was 1.75 mg/g of toxin. Does the new 
data contradict the value suggested by prior research? Test the relevant 
hypotheses. Does the validity of your analysis depend on any assumptions 
about the population distribution of neutralizing amount? Explain. 

The sample average unrestrained compressive strength for 45 specimens of a 
particular type of brick was computed to be 3107 psi, and the sample standard 
deviation was 188. The distribution of unrestrained compressive strength may 
be somewhat skewed. Does the data strongly indicate that the true average 
unrestrained compressive strength is less than the design value of 3200? Test 
using a= .001. 

To test the ability of auto mechanics to identify simple engine problems, an 
automobile with a single such problem was taken in turn to 72 different car 
repair facilities. Only 42 of the 72 mechanics who worked on the car correctly 
identified the problem. Does this strongly indicate that the true proportion of 
mechanics who could identify this problem is less than .75? Test the appro- 
priate hypotheses. 

The December 30, 2009, the New York Times reported that in a survey of 
948 American adults who said they were at least somewhat interested in 
college football, 597 said the Bowl Championship System should be replaced 
by a playoff similar to that used in college basketball (in fact, a playoff system 
replaced the BCS starting with the 2014 season). Does this provide compel- 
ling evidence for concluding that a majority of all such individuals favored 
replacing the BCS with a playoff at that time? Test the appropriate 
hypotheses. 

An article in the November 11, 2005, issue of the San Luis Obispo Tribune 
reported that researchers making random purchases at California Wal-Mart 
stores found scanners coming up with the wrong price 8.3% of the time. 
Suppose this was based on 200 purchases. The National Institute for Standards 
and Technology says that in the long run at most two out of every 100 items 
should have incorrectly scanned prices. Carry out an appropriate hypothesis 
test to decide whether the NIST benchmark is not satisfied. [Caution: Are the 
conditions for a one-proportion z test met? If not, what distribution can be 
used instead? ] 

Annual holdings turnover for a mutual fund is the percentage of a fund’s 
assets that are sold during a particular year. Generally speaking, a fund with a 
low value of turnover is more stable and risk averse, whereas a high value of 
turnover indicates a substantial amount of buying and selling in an attempt to 
take advantage of short-term market fluctuations. Here are values of turnover 
for a sample of 20 large-cap blended funds extracted from Morningstar.com: 
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1.03 1.23 1.10 1.64 1.30 1.27 1.25 0.78 1.05 0.64 
0.94 2.86 1.05 0.75 0.09 0.79 1.61 1.26 0.93 0.84 


(a) Would you use the one-sample ¢ test to decide whether there is compelling 
evidence for concluding that the population mean turnover is less than 
100%? Explain. 

(b) A normal probability plot of the 20 In(turnover) values shows a very 
pronounced linear pattern, suggesting it is reasonable to assume that the 
turnover distribution is lognormal. Recall that X has a lognormal distribu- 
tion if In(X) is normally distributed with mean value w and standard 
deviation o. Because yp is also the median of the In(X) distribution, e” is 
the median of the X distribution. Use this information to decide whether 
there is compelling evidence for concluding that the median of the 
turnover population distribution is less than 100%. 

When X,, X2, ..., X,, are independent Poisson variables, each with parameter 

H, and n is large, the sample mean X has approximately a normal distribution 

with E(X) = w and Var(X) = p/n. This implies that 


=H 
Vu/n 


has approximately a standard normal distribution. For testing Ho: “= Mo, we 
can replace y by po in the equation for Z to obtain a test statistic. This statistic 
is actually preferred to the large-sample statistic with denominator S/,/n 
(when the X;s are Poisson) because it is tailored explicitly to the Poisson 
assumption. If the number of requests for consulting received by a certain 
statistician during a 5-day work week has a Poisson distribution and the total 
number of consulting requests during a 36-week period is 160, does this 
suggest that the true average number of weekly requests exceeds 4.0? Test 
using a= .02. 
When the population distribution is normal and n is large, the sample standard 
deviation S has approximately a normal distribution with E(S)*o and 
Var(S) + 07/(2n). We already know that in this case, for any n, X is normal 
with E(X) = wand Var(X) = o?/n. 
(a) Assuming that the underlying distribution is normal, what is an approxi- 
mately unbiased estimator of the 99th percentile 06 = +2.330? 
(b) When the X;s are normal, it can be shown that.X and S are independent rvs 
(one measures location whereas the other measures spread). Use this to 


Z= 


compute Var(0) and oj for the estimator 6 of part (a). What is the 
estimated standard error 6? 
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(c) Write a test statistic formula for testing Hp: 9 = Op that has approximately 
a standard normal distribution when Ho is true. If soil pH is normally 
distributed in a certain region and 64 soil samples yield xX = 6.33, s=.16, 
does this provide strong evidence for concluding that at most 99% of all 
possible samples would have a pH of less than 6.75? Test using a=.01. 


Markov Chains 


This chapter explores the properties of a broadly applicable probability model 
called a Markov chain, named after Russian mathematician A. A. Markov (1856— 
1922). Markov observed that many real-world phenomena can be modeled as a 
sequence of “transitions” from one “state” to another, with each transition having 
some associated uncertainty. For example, a taxi driver might “transition” between 
several towns (or zones within a large city); each time he drops off a passenger, he 
can’t be certain where his next fare will want to go. Similarly, a gambler might 
think of her winnings as transitioning from one “state”—teally, a dollar amount—to 
another; with each round of the game she plays, she cannot be certain whether that 
dollar amount will go up or down (though, obviously, she hopes it goes up!). The 
same could be said for modeling the daily closing prices of a stock: each new day, 
there is uncertainty about whether that stock will “transition” to a higher or lower 
value, and this uncertainty could be modeled using the tools of probability. 

In all of these examples, aside from the probability model for how transitions 
occur, one extra piece of information is critical: the current “state” (where the taxi 
driver is, how much money the gambler has). After all, if the gambler is making $5 
wagers, how much money she might have after the next game depends on how 
much she has now—if she currently holds $45 in chips, then at the end of the 
upcoming round she can only have $40 or $50 on an even bet. The model structure 
proposed by Markov applies to situations where only knowledge of the current 
state, and the nature of transitions, is necessary—we don’t care how our gambler 
arrived at $45 in chips, only that that’s how much she currently possesses. 

Section 6.1 introduces basic notation for Markov chains and provides a rigorous 
definition of the property alluded to in the previous paragraph. In Sects. 6.2 and 6.3 
we explain how the use of matrix notation can facilitate Markov chain 
computations. Section 6.4 focuses on a special class of Markov chains, so-called 
regular chains, which have a rather exceptional property embodied in the Steady- 
State Theorem. Section 6.5 considers a different class of Markov chains, those with 
one or more “inescapable” states, such as a gambler going broke. Finally, Sect. 6.6 
discusses the simulation of Markov chains using software. 
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6.1 Terminology and Basic Properties 


Markov chains provide a model for sequential information that allows future 
outcomes to depend on previous ones, albeit in a very specific way (the defining 
Markov property). Researchers in numerous fields employ Markov chains to model 
the phenomena they study. Recent examples include predicting changes in electric- 
ity demand; modeling the motion of sperm whales off the Galapagos Islands; 
Chinese citizens changing their cell phone service; keeping track of inpatient bed 
usage at hospitals; monitoring patterns in Web browser histories to deploy better- 
targeted advertising; the evolution of drought conditions over time; and the dynam- 
ics of capital assets. 

This first section introduces the basic vocabulary and notation of Markov chains. 
We begin with the following classic (if slightly artificial) example, which will serve 
as a thread throughout the chapter. 


Example 6.1 A city has three different taxi zones, numbered 1, 2, and 3. A taxi 
driver operates his cab in all three zones. The probability that his next passenger has 
a destination in a particular one of these zones depends on where the passenger is 
picked up. Specifically, whenever the taxi driver is in zone 1, the probability his 
next passenger is going to zone | is .3, to zone 2 is .2, and to zone 3 is .5. Starting in 
zone 2, the probability his next passenger is going to zone | is .1, to zone 2 is .8, and 
to zone 3 is .1. Finally, whenever he is in zone 3, the probability his next passenger 
is going to zone | is .4, to zone 2 is .4, and to zone 3 is .2. These probabilities are 
encapsulated in the state diagram in Fig. 6.1. 

In every such state diagram, the sum of the probabilities on branches exiting any 
state must equal 1. For example, in Fig. 6.1 the probabilities exiting state 2 (1.e., 
zone 2) are .1, .8, and .1. We include in this calculation the probability .8 indicated 
by a “loop” in the state diagram, which simply means that the taxi driver has .8 
probability of staying in zone 2 once he has dropped off a fare in zone 2. 

Define XQ to be the zone in which the taxi driver starts and X,, (n> 1) to be the 
zone where he drops off his nth fare. Since Xo, X1, X2, .. . “occur” in sequence, they 
are often referred to as a chain. More precisely, this particular sequence is a finite- 


Fig. 6.1 State diagram for 
Example 6.1 
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state, discrete-time, time-homogeneous Markov chain. Each of these terms will be 
explained shortly. a 


In Example 6.1, each of the X,, for n > 0 assumes the value 1, 2, or 3 according to 
the destination zone. The zones collectively constitute the states of our chain, and 
so the state space is {zone 1, zone 2, zone 3}, although we will often drop the state 
names and just use the integers {1, 2, 3}. States can be identified with physical 
locations, levels (such as high/medium/low), dollar amounts, or just about anything 
else. We’ll sometimes refer to the X,, as random variables, even though they are not 
necessarily numerical (which goes against the definition from Chap. 2). The 
random variable Xo is called the initial state of the chain. A discrete-space chain 
is one for which the number of possible states is finite or countably infinite. If there 
are finitely many possible states, we have a finite-state chain. 

Since time was indexed by the discrete listing n=0, 1, 2, ..., the sequence of 
zones the taxi driver visited in Example 6.1 is called a discrete-time chain. 
Section 7.7 gives an overview of continuous-time chains, often indexed as {X;,: 
t © [0, oo)}, which are useful for modeling behavior continuously in time rather 
than just at discrete time points (e.g., tracking over time the number of people 
looking at a particular Web site). The taxi driver chain is also time-homogeneous, 
in that the specified probabilities do not change over time. One could imagine a 
different, more complicated model where the probabilities specified in Example 6.1 
apply during morning hours but not in the evening, so that the probability of taking 
a fare from zone | to zone 3 is .5 for n= 1 (beginning of the work day) but is .1, say, 
for n= 20 (end of his shift). See Exercises 78 and 79 for examples of nonhomoge- 
neous Markov chains. 


Example 6.2 This is a simple version of the famous Gambler’s Ruin problem, 
which we previously considered in Exercise 145 of Chap. 1. Allan and Beth play a 
succession of independent games for $1 each. Suppose Allan starts with $2 and 
Beth with $1, and the chance of Allan winning $1 is p on each game. Ties are not 
allowed, so the chance of Beth winning $1 on any particular game is 1 — p. They 
compete until one of the two players goes broke (has $0). 

For n=0, 1, 2, ..., define X,, =the amount of money Allan has after n games. 
The initial state has been specified as Xo = $2; Allan’s successive holdings Xo, X1, 
X>,... form our chain. The state space for X,, is {$0, $1, $2, $3} or just {0, 1, 2, 3}, 
so we again have a finite-state chain. The state space and the specified probabilities 
are illustrated by the state diagram in Fig. 6.2. Notice we have included two “loops” 
with probability 1 at $0 and $3—1these reflect the constraint that the game stops 
once Allan reaches one of these dollar amounts. That is, once Allan is “at” $3, he 
will stay at $3, and the same goes for $0. Also, it will always be understood that if 
no arrow points from state i to state j in such a diagram, then the probability of 
moving from state 7 immediately into state j (i.e., in one time step) is zero. 
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Fig. 6.2 State diagram for Example 6.2 a 
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Example 6.3. A random walk. Imagine a marker initially placed at 0 on the number 
line. A fair coin is flipped repeatedly; each head moves the marker one integer to the 
right, while each tail moves it one integer to the left. Let Xo = 0, the initial state, and 
X,, = the marker’s position after n coin flips for n > 1. Each member of the chain can 
only take on a finite set of values: X, is either +1 or —1, X2 is one of —2, 0, or 2, and 
so on. However, the collection of all possible states across all time indices 
comprises the entire set of integers: {..., —3, —2, —1, 0, 1, 2, 3, ...}. Thus, this 
so-called “random walk” is an infinite-state (though still discrete-state) chain; it is 
partially illustrated in Fig. 6.3. 


“OOO S0+0- 


Fig. 6.3 State diagram for Example 6.3 a 


6.1.1 The Markov Property 


All of the preceding examples have an important feature known as the Markov 
property. Loosely speaking, it says that in order to know where the chain will go 
next (say, X,;), it suffices to know where the chain is now (the value of X,,). In 
particular, once the current state is specified, the path that brought the chain to that 
state is irrelevant. Consider, for example, the random walk of Example 6.3: if for 
any particular n we have X, =4, then we know X,,,;=3 or 5 with probability .5 
each. It does not matter whether the chain arrived at 4 quickly (0 — 1 ~ 2 3 — 4) 
or by a more circuitous route; the probability distribution of the next state in the 
chain is the same. This notion is formalized in the following definition. 


DEFINITION 

Let Xo, X;, X2, ... be a sequence of random variables (a chain) on some discrete 
state space. The sequence is said to have the Markov property if, for any time 
index n and any set of (not necessarily distinct) states 50, 51, .-- Sn» Sn+1> 


PUGS = Sn+1 | Xo = So, X1 =S],.--- XG; = Si) = P(X = Sn+1 | Xn = Sn) 
(6.1) 


Such a sequence {X,: n=0, 1, 2, ...} is called a Markov chain. 
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The conditional probabilities specified in Expression (6.1) are called the one- 
step transition probabilities of the chain, or sometimes just transition 
probabilities. These are precisely the probabilities specified in Examples 6.1-6.3. 
It’s critical to recognize that these are conditional probabilities: they specify the 
likelihood of the next member of the chain X,,,; being in any particular state, given 
the current state of the chain X,,. 


Example 6.4 (Example 6.1 continued) The sequence of successive zones visited 
by our taxi driver is characterized by nine one-step transition probabilities. For 
example, it is stated that the driver “transitions” from zone | to zone 3 with 
probability .5, which means that for any time index n, 


Pike — se =i) 5 


This probability does not depend on the value of n, because the chain is time- 
homogeneous. Instead of writing P(X,,,;=3lX,=1)=.5, we will sometimes 
abbreviate with P(1— 3)=.5 to emphasize the idea of transitioning from one 
state to another. Thus, the complete set of one-step transition probabilities for the 
taxi driver is 


Pl >l=3 PIlL>2=2 P(1>3)=.5 
P2>1)=.1 P(I232)=8 P(2>3)=.1 


P3B31)=4 PB>2=A4 P(333)=2 = 


Example 6.5 (Example 6.2 continued) The changing fortunes of Allan are 
governed by six (non-zero) transition probabilities: 


P(is0)=1-p P(l13>2)=p P(2>1)=1-p P(2>3)=p 
P(0> 0) =1 - 


The last two probabilities above correspond to termination of the sequence of 
games. From a mathematical (if not practical) perspective, they communicate the idea 
that the chain marches on even when gameplay has ended (e.g., 2-3 > 3-—3...). 
That is, the conditional probability P(3 — 3) = P(X,,,; = 3IX, = 3) = 1 indicates that if 
Allan has all $3 at stake after n games, he will retain his $3 while some imaginary 
future gameplay continues (the (7+ 1)st game, the (7+ 2)nd game, etc.). This conven- 
tion eliminates the need to “stop” the Markov chain at some particular time point 7. 
We’ll elaborate much more on this in Sect. 6.5. 

In addition, there are ten one-step transition probabilities that equal zero; for 
example, according to the rules of Gambler’s Ruin, P(1 — 3)=0, and P(3 — x) =0 
for x © {0, 1,2}. In general, a finite-state Markov chain with s states is specified by s” 
one-step transition probabilities, although it is quite common for many (if not most) of 
these to be zero. a 
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Example 6.6 Markov chains are often used to model changing weather conditions; 
research literature in both meteorology and climate science is rife with Markov 
chain applications. The article “To Ski or Not to Ski: Estimating Transition 
Matrices to Predict Tomorrow’s Snowfall Using Real Data” (J. of Statistics 
Educ., vol. 18, no. 3, 2010) provides data for several US cities on the daily 
transitions between “snow days,” defined by a snow depth of at least 50 mm, and 
“green days” (snow depth < 50 mm). Let X,, represent the snow status, either S for 
snow or G for green, on the nth recorded day. For New York City, the following 
one-step transition probabilities are provided: 


P(G>G)=.964 P(G>S)=.036 P(S>G)=.224 P(S—S)=.776 


If today is a “green day” in New York, then there is a 96.4% chance that 
tomorrow’s snow depth will also be below 50 mm, based on the available weather 
data (which, incidentally, stretches back to the year 1912 for New York). On the 
other hand, as the author notes, “the presence of a significant snow depth (accumu- 
lation) on the current day in Central Park (New York) has an approximately 1 in 
5 chance of melting before the next day.” a 


Not all sequences of random variables possess the Markov property. In econo- 
metrics (statistical methodology applied to economic scenarios), for example, most 
models for the closing price X,,4; of a stock on the (7+ 1)st day of trading incorpo- 
rate not only the previous day’s closing price X, but also information from many 
previous days (the data X,,_, X,—2, and so on). The likelihood that X,,,; will be $5 
higher than X,, may depend on the stock’s behavior over all of last week, not just 
where it closed on day n. 

That said, in some instances a model that includes more than a one-time-step 
dependence can be modified by reconfiguring the state space in such a way that it 
satisfies the Markov property. This expansion of states is illustrated in the next 
example. 


Example 6.7 The weather model presented in Example 6.6 satisfies the Markov 
property; in particular, it assumes that one can model tomorrow’s weather based on 
today’s conditions without incorporating any previous information. A more realis- 
tic model might assume that tomorrow’s snow depth depends on today’s and 
yesterday’s weather. Suppose, for example, that tomorrow will be a snow day 
with probability .8 if both yesterday and today were snow days; with probability 
.6 if today was a snow day but yesterday was a green day; with probability .3 if it 
was green today and snowy yesterday; and with probability .1 if both previous days 


were green. 
Once again let X,,= the “state” of the weather on day n: G for green day, S for 
snow day. Then the sequence Xo, Xj, Xo, .... of weather states does not satisfy the 


Markov property, because the conditional distribution of X,,; given all previous 
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weather information depends on both X,, and X,,__; (the previous two days’ weather 
conditions). Let’s make the following modification: define Y,, to be the ordered pair 
Y,, = (day n weather, day n+ 1 weather) = (X,,, Xn+1) 

So, for example, if snow depth was > 50 mm on day 4 but < 50 mm on day 
5, then Y, = (S, G). The weather on day 6 depends on these previous 2 days, but they 
are now both contained in a single “variable,” Y4. In other words, Y; can be modeled 
entirely by knowing Y,: Ys’s first entry, X5, matches the second entry of Y4, and the 
probability distribution of the second entry of Ys (i.e., X¢) is determined by the rules 
given at the beginning of this example. 

With this modification, the sequence Yo, Y;, Y2, ... forms a Markov chain. The 
state space of this chain is not {S, G}, but rather {(G, G), (G, S), (S, G), (S, S)}. The 
earlier weather rules can be expressed as one-step transition probabilities for this 
chain: 


P((S,S) 4 (S,5))=8 P((S,G)3(G,S) =3 
P((G,8S) 4 (S,8))=.6  P((G,G)>(G,S))=.1 


Four other transition probabilities can be found by considering the complements 
of the given transition events. The final eight transition probabilities (with 
four states, there are 4°=16 total one-step transition probabilities) are all 
0, e.g., PS, G) — (S, S)) =0, because if Y,, =(S, G) then it was “green” on day 
n+1 (Xn41=G), meaning the first entry of Y,,,; must also be G. | 


The remainder of this chapter will focus almost exclusively on finite-state, 
discrete-time, time-homogeneous chains; these are the most commonly encoun- 
tered models in practice. The case of infinite-state chains, including the random 
walk of Example 6.3, is considered in several more advanced texts; see, for 
example, the book Jntroduction to Probability Models by Ross listed in the 
references. 


6.1.2 Exercises: Section 6.1 (1-10) 


1. The article “Markov Chain Models of Negotiators’ Communication” (Encyclo- 
pedia of Peace Psychology 2012: 608-612) describes the following set-up for the 
back and forth dialogue between two negotiators. If at any stage a negotiator 
engages in a cooperative strategy, the other negotiator will respond with a 
cooperative strategy with probability .6. Otherwise, the response is described 
as a competitive strategy. Similarly, there is probability .7 that a competitive 
strategy offered at any stage of the negotiations will be met by another competi- 
tive strategy. Let X,,= the strategy employed at the nth stage of the negotiation. 
Identify the state space for the chain, specify its one-step transition probabilities, 
and draw the corresponding state diagram. 

2. Imagine m balls being exchanged between two adjacent chambers (left and right) 
according to the following rules. At each time step, one of the m balls is 
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randomly selected and moved to the opposite chamber, i.e., if the selected ball is 

currently in the right chamber, it will be moved to the left one, and vice versa. 

Let X,, =the number of balls in the left chamber after the nth exchange. (This is 

called an Ehrenfest chain, a model often used to describe the movement of gas 

molecules.) 

(a) Identify the state space of this chain. 

(b) Suppose m= 3. Specify the one-step transition probabilities for this chain. 
[Hint: It might be helpful to draw the two chambers and the possible 
positions of the three balls.] 

(c) Draw the state diagram corresponding to (b). 

(d) Generalize the probabilities in (b) to the case of m balls. 

3. A certain machine used in a manufacturing process can be in one of three states: 
fully operational (“full”), partially operational (“part”), or broken. If the 
machine is fully operational today, there’s a .7 probability it will be fully 
operational again tomorrow, a .2 chance it will be partially operational tomor- 
row, and otherwise tomorrow it will be broken. If the machine is partially 
operational today, there is a .6 probability it will continue to be partially 
operational tomorrow and otherwise it will be broken (because the machine is 
never repaired in its partially operational state). Finally, if the machine is broken 
today, there is a .8 probability it will be repaired to fully operational status 
tomorrow; otherwise, it remains broken. Let X,,= the state of the machine on 
day n. 

(a) Identify the state space of this chain. 

(b) Determine the complete set of one-step transition probabilities, and draw the 
corresponding state diagram. 

4. Michelle will flip a coin until she gets heads four times in a row. Define Xp = 0 
and, for n > 1, X,,=the number of heads in the current streak of heads after the 
nth flip. 

(a) If the first seven flips result in the sequence HTHHHTH, determine the 
values of X,, Xo, ..., X7. [Hint: Each time Michelle flips tails, the streak is 
reset to 0.] 

(b) Is this an example of a Markov chain? Explain why or why not. 

(c) Identify the state space of the chain. Treat reaching four heads in a row in the 
same manner that the $3 state was treated in the Gambler’s Ruin scenario of 
Example 6.2. 

(d) Assume P(H) = p for this particular coin. Determine the one-step transition 
probabilities of this chain, and draw the corresponding state diagram. 

5. A single cell has probability p of dividing into two cells and probability 1 — p of 
dying without dividing. Once two new cells have been created, each has the 
same probability p of splitting in two, independent of the other. In this fashion, 
cells continue to divide, either indefinitely or until all cells are dead (extinction 
of the cell line). Let X,, = the number of cells in the nth generation, with Xo = | to 
reflect the initial, single cell. 

(a) What are the possible numerical values of X,, and what are their 
probabilities? 
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(b) What are the possible numerical values of X>? 

(c) Determine the one-step transition probabilities for this chain. That is, given 
there are x cells in the nth generation (X,,= x), determine the conditional 
probability distribution of X,,,;. 

[Note: This is an example of a branching process, commonly known as a 

Galton-Watson process. See Exercise 163 at the end of Chap. 4 for information 

on determining the probability of eventual extinction. ] 

6. Imagine a set of stacked files, such as papers on your desk. Occasionally, you 
will need to retrieve one of these files, which you will find by “sequential 
search”: looking at the first paper in the stack, then the second, and so on until 
you find the document you require. A sensible sequential search algorithm is to 
place the most recently retrieved file at the top of the stack, the idea being that 
files accessed more often will “rise to the top” and thus require less searching in 
the long run. For simplicity’s sake, imagine such a scenario with just three files, 
labeled A, B, C. 

(a) Let X, represent the sequence of the entire stack after the mth search. For 
example, if the files are initially stacked A on top of B on top of C, then 
X 9 =ABC. Determine the state space for this chain. 

(b) If Xp = ABC, list all possible states for X,. [Hint: One of the three files will be 
selected and rise to the front of the stack. Is every arrangement listed in 
(a) possible, starting from ABC?] 

(c) Suppose that, at any given time, there is probability p, that file A must 
be retrieved, pg that file B must be retrieved, and similarly for 
Pc (=1-—pa-—pep). Determine all of the non-zero one-step transition 
probabilities. 

7. Social scientists have used Markov chains to study “social mobility,” the 
movement of people between social classes, for more than a century. In a typical 
such model, states are defined as social classes, e.g., lower class, middle class, 
and upper class. The time index n refers to a familial generation, so if X,, 
represents a man’s social class, then X,,_, is his father’s social class, X,,_» his 
grandfather’s, and so on. 

(a) In this context, what would it mean for X, to be a Markov chain? In 
particular, would that imply that a grandfather’s social class has no bearing 
on his grandson’s? Explain. 

(b) What would it mean for this chain to be time-homogeneous? Does that seem 
realistic? Explain why or why not. 

8. The article “Markov Chain Models for Delinquency: Transition Matrix Estima- 
tion and Forecasting” (Appl. Stochastic Models Bus. Ind., 2011: 267-279) 
classifies loan status into four categories: current (payments are up-to-date), 
delinquent (payments are behind but still being made), loss (payments have 
stopped permanently), and paid (the loan has been paid off). Let X,, = the status 
of a particular loan in its nth month, and assume (as the authors do) that X,, is a 
Markov chain. 

(a) Suppose that, for one particular loan type, P(delinquent — current) = .1 and 
P(current — delinquent) = .3. Interpret these probabilities. 
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(b) According to the definitions of the “loss” and “paid” states, what are 
P(loss — loss) and P(paid — paid)? [Hint: Refer back to Example 6.2.] 

(c) Draw the state diagram for this Markov chain. 

(d) What would it mean for this Markov chain to be time-homogeneous? Does 
that seem realistic? Explain. 

9. The article cited in Exercise 1 also suggests a more complex negotiation model, 
wherein the strategy employed at the nth stage (cooperative or competitive) is 
predicted not only by the immediately preceding action but also the one before 
it. So, negotiator A’s next strategy is determined not only by negotiator B’s most 
recent move, but also by A’s choice just before that. Again, let X,,= the 
negotiating strategy used at the nth stage. 

(a) Is X, a Markov chain? Explain. 

(b) How could you modify this example to create a Markov chain? What 
additional information would you need to completely specify this chain? 
[Hint: See Example 6.7.] 

10. Let Xo, X1, Xz, ... be a sequence of independent discrete rvs taking values in 

some common state space. 

(a) Show that X,, satisfies the Markov property. (That is, all sequences of 
independent rvs on a common state space are trivially discrete-space 
Markov chains.) 

(b) What additional condition(s), if any, must be satisfied for X,, to be a time- 
homogeneous Markov chain? 


6.2.‘ The Transition Matrix and the Chapman-Kolmogorov 
Equations 


Section 6.1 introduced the notion of a Markov chain and its characteristic one-step 
transition probabilities. In this section, we will develop a systematic way to 
determine the probability that a chain moves from one state to another in two 
steps (or three or four ...) by considering all the intermediate paths the chain may 
have taken. Such calculations are facilitated by aggregating the transition 
probabilities into a matrix. 


6.2.1. The Transition Matrix 


The one-step transition probabilities for the taxi driver example were displayed 
in Example 6.4 as a 3 x3 array. It would be more efficient to simply specify 
the probabilities themselves in that same format, with the understanding that 
the probability in the ith row and jth column indicates the transition probability 
P(i— j), the chance the taxi driver takes his next fare to zone j given that he picks 
up the fare in zone 7. Such a representation will be critical to understanding how 
various multistep transition probabilities are calculated. 
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DEFINITION 
Let Xo, X1, X2, ... be a finite-state, time-homogeneous Markov chain, and 
index the states of the chain by the positive integers 1, 2, ..., s. The (one-step) 
transition matrix of the Markov chain is the s x s matrix P whose (i, j)th 
entry is given by 

Pi = Pi =) = P(X =Jj | Xn = i) 


HOP P= Nh coe SF GUNG! f= Il, oc eg MS 


Example 6.8 (Example 6.4 continued) The one-step transition matrix for our taxi 
driver example is 


3 2 5 
P=].1 .8 «1 
4°44 «#2 


which is identical in format to the display in Example 6.4. The entries are 
interpreted as the preceding definition suggests, e.g., the upper left entry (first 
row, first column) of the matrix is 


py = POS 1) =P = 1 | X= 1) = 3, 


i.e., the conditional probability that his next fare is dropped off somewhere in zone 
1 given that the taxi is currently in zone 1. a 


Example 6.9 (Example 6.5 continued) For the Gambler’s Ruin scenario with a 
total available fortune of $3, rather than label the four possible states as 1, 2, 3, 4, 
it’s more natural to use state labels 0, 1, 2, and 3 corresponding to Allan’s fortune at 
any particular time. The transition probabilities specified previously may be written 
as the following 4 x 4 matrix: 


| 

wv 

oO 
eo-o'= :S 
ev CO 


The labels along the left-hand side of the matrix indicate the ordering of the 
states for the purpose of creating this matrix; they are not, strictly speaking, a part of 
P. For example, P(X,,,; = 1X, = 2) = P(Allan loses the next game) = | — p, while 
P(Xn41 = 31X,, = 0) =0. | 


Example 6.10 (Example 6.6 continued) The snow depth model has only two 
states, S (snowy day) and G (“green” day). The one-step transition probabilities 
given for New York City can be summarized by the following 2 x 2 matrix: 
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p_© Ee 7 


224.776 i 


Notice that the entries of every row in all of the preceding transition matrices 
sums to |. This will always be the case: given that the chain is currently in some state 
i, it has to go somewhere in its next step (even if that entails remaining in state /). 
That is, for any state 7 and any time index 1, we must have 


Spy = D> PE i) = 2 P(Knvt =F [Xn =i) = 1 
j=l j=l j=l 


6.2.2 Computation of Multistep Transition Probabilities 


We now turn to the determination of multistep transition probabilities. Given that a 
Markov chain is currently in state 7, what is the probability it will be in state 7 two 
steps later (i.e., after two transitions)? Three steps later? We begin with the 
following definition. 


DEFINITION 
Let Xo, X;, X2, ... be a time-homogeneous Markov chain. For any positive 
integer k, the k-step transition probabilities are defined by 


pl) (i =) = Popes =J | Xn = i) (6.2) 


where i and j range across the states of the chain (typically 1,..., 5s). Fork=1, 
we will typically revert to the previous notation: BOG = Slo = i). 


The superscript (k) in Expression (6.2) does not indicate taking the kth power; it 
is simply notation representing “in k steps.” The next example illustrates how these 
k-step transition probabilities can be calculated, and how they can be represented 
compactly in terms of powers of the one-step transition matrix. 


Example 6.11 (Example 6.8 continued) Suppose our taxi driver just dropped off 
a fare in zone 3, so that that is his current state. What is the probability that his 
second fare, counting from now, takes him to zone |? That is, we wish to determine 
P(Xy42 = 11X, = 3) = PR — 1). The calculation method is suggested by Fig. 6.4. 
Consider all the possible destinations of the (7+ 1)st fare, i.e., all the intermediate 
steps the taxi driver could take from zone 3 to zone 1, and then employ the Law of 
Total Probability (applied here to conditional probabilities). 

The partitioning events in the Law of Total Probability are the possible states at 
time n+ 1: 
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Fig. 6.4 Transitioning from time n time +1 time n+2 
state 3 to state 1 in two time 
steps 


3 1) = P(Xn2 =1|X, = 3) 
=P (ai = 1 | Se =3) Pee 1 | ea oe =) 
+ P(Xnu1 = 2 |X, = 3)P(Xn2 = 1 |X, = 3, Xn = 2) 
+P (Xai =3.| Xe =3)P Rae =1 |X, = 3, 2n0 = 3) 


By the Markov property, P(Xj42 = 1X, =3, Xnap= 1) = P(Xng2 = 1X41 = VD, 
and the other two probabilities involving conditioning on X, and X,,,; simplify 
analogously. Thus, 


P?)(3 1) = P(Xng1 =1| Xn =3)P(Xng2 = 1| Xn = 1) 

+ P(Xn1 = 2 | X_=3)P(Xng2 = 1 | Xny1 = 2) 

+ P(Xn41 =3 | Xn =3)P(Xn42 = 1 | Xn41 = 3) 

=P(3 > 1)P(1 1) + P(3 > 2)P(2 > 1) + P(3 > 3)P(3 1) 
= (.4)(.3) + (4) (1) + (.2)(.4) = .24 


For later reference, the last expression could be written in terms of the elements 
of the transition matrix P; specifically, it’s p31p11+)32P21+P33P31- 

Similarly, the conditional probability that his second fare wants to be dropped 
off in zone 2 is computed by 


P23 


l| 
Mee 


= P(3- 1)P ae as ae (2 > 2) + P(3 — 3)P(3 — 2) 
P(3 + j)P(j > 2) = paps 


'4) (2) + (.4) (.8) + (2) (4) = 48 


Finally, the probability the taxi driver finds himself back in zone 3 after two 
fares is 


—m~u. 


3 
(3 — 3) = Do Pais = = (.4)(.5) + (.4)(.1) + (.2)(.2) = .28 
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This makes sense, since the taxi driver must arrive in one of the three zones at 
time n+2, and 1 — (.24+.48) =.28. | 


The sums of products of matrix entries that appear repeatedly in the preceding 
example should look familiar: they are the same manner of computation that arises 
when one matrix is multiplied by another (or, here, a matrix is multiplied by itself). 
Indeed, consider what happens if we multiply the one-step transition matrix P from 
Example 6.8 by itself: 


3° 2 WD | [ad 2D 31 42° 27 
P-=pPp=|.1 8 .1 L801) =] .15 70) 115 
4 4 2 4 4 2 24 48 .28 


The entries in the bottom row—.24, .48, .28—are precisely the two-step transi- 
tion probabilities computed in Example 6.11. Specifically, the (3, 1) entry of P* is 
P(3 = 1) =.24, the (3, 2) entry of P? is P(3 — 2) =.48, and the (3, 3) entry 
of P is PER — 3) =.28. It should come as no surprise that the other six entries of 
P’ follow the same pattern: the (i, 7) entry of P? is equal to POU J). Hence, we 
can obtain all nine two-step transition probabilities with a single matrix computa- 
tion (which, of course, can be facilitated by Matlab or other matrix-capable 
software). 

The foregoing result can be generalized to an arbitrary fixed number of steps: to 
find the three-step transition probabilities, for example, one only needs to compute 
the matrix P*. It is not necessary to consider explicitly the many different paths by 
which the Markov chain could transition from state 7 to state j in three steps and add 
up all the corresponding probabilities (this is, secretly, what the threefold matrix 
multiplication does). The most general result is often referred to as the set of 
Chapman—Kolmogorov Equations. 


CHAPMAN-KOLMOGOROV EQUATIONS 
If a Markov chain has one-step transition matrix P, then the k-step transition 
probabilities are the entries of the matrix Pp. Specifically, 


P\ (i — j) = the (i,j) entry of P* 


Example 6.12 (Example 6.11 continued) Back to our intrepid taxi driver: if he 
just dropped off a fare in zone 2, what is the probability that he will be in zone 
1 two fares later? That is, we wish to determine the two-step transition probability 
P(Xya2 = 1X, = 2) = P2 — 1). According to the Chapman—Kolmogorov Equations, 
this is simply the (2, 1) entry of the foregoing matrix P: 


P®(2 3 1) = .15 
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Now consider a longer-term question: If the taxi driver starts the day in zone 3 and 
transports ten fares before lunch, what is the probability he ends up “back home” 
(i.e., in zone 3) for lunch? The goal is to find P(X19 = 3IXp = 3) = P“'(3 — 3), which 
could involve summing up a terrifying number of intermediate travel options (19,683 
of them, to be precise!). But the Chapman—Kolmogorov Equations, coupled with 
computer software, makes light work of the problem. With the aid of Matlab, the 
tenth power of P is found to be 


2004 .5993 .2004 
P’°= | 1998 .6004 .1998 
2002 .5996 .2002 


The desired probability is just the (3, 3) entry of this 10-step transition matrix: 
P!(3 — 3) =.2002. = 


Example 6.13 The report “Research and Application by Markov Chain Operators 
in the Mobile Phone Market” (Second International Conference on Artificial 
Intelligence, Management Science and Electronic Commerce (AIMSEC), 2011) 
details an analysis of customer loyalty and movement between China’s three major 
cell phone service providers: (1) China Telecom, (2) China Unicom, and (3) China 
Mobile. A “transition” in this setting refers to an opportunity for a customer to 
renew his or her contract with a current provider or else switch to one of the other 
two companies. The report includes the following one-step transition matrix, with 
the companies numbered as above: 


84 06 .10 
P=] .08 .82 .10 
10 04 .86 


The entries along the main diagonal indicate customer loyalty, e.g., 84% of 
China Telecom customers stick with that company when their contract expires. 

Suppose a customer is currently with China Unicom. What is the probability she 
will be with the same service provider three contracts from now? In other words, 
what is P(2 — 2)? According to the Chapman—Kolmogorov Equations, we need 
the (2, 2) entry of P*. That matrix is computed to be 


6310 .1352 .2338 
P? = | .1920 .5742 .2338 
2267 .1006 .6727 


from which we may extract pe (2 2) =.5742. 

It’s important to distinguish this probability from the answer to a more 
restrictive question: what’s the chance she stays with China Unicom for all of 
her next three cell phone contracts? This probability can be represented as 
P(Xn 41 =20Xp42=2NXy 43 = 21X, = 2) or, less formally, as P(2— 2-2 — 2). 
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Applying the Markov property gives [P(2 > 2) = p3)= (.82)°=.5514. This proba- 
bility is slightly lower than P(2 — 2) = .5742, since the latter accounts for the 
possibility that the customer switches companies at some intermediate stage(s) but 
ends up back with China Unicom three contracts later. a 


Example 6.14 (Example 6.9 continued) Suppose in our earlier Gambler’s Ruin 
example that p = .55; that is, Allan has a 55% chance of winning any particular $1 
game. The one- and two-step transition matrices are as follows: 


0 1 0 0 0 

p= 1) 45 .2475 0 3025 
2 | .2025 0 247555 
3 0 0 0 1 

As before, Allan starts with $2. Looking at the $2 row (i.e., the third row) of P’, 
there is a .2025 probability he has gone broke after two games. This is easy to 
compute by hand: since he could only lose $2 in two games by losing twice, 
the chance is (.45)° = .2025. The chance that he is back to where he started after 
two games (i.e., X»=$2) is the ($2, $2) entry of P*: .2475. This also could 
have occurred in just one way: $2—$1— $2, for which the two-step transition 
probability is (.45)(.55)=.2475. Notice that the ($2, $1) entry of P’ is 0, ie., 
P($2 — $1) =0. Since exactly $1 exchanges hands at the end of each game, it’s 
impossible for Allan to transition from $2 to $1 in exactly two steps. Finally, observe 
that the ($2, $3) entry of both matrices is .55, so P($2 — $3) = P($2 — $3) =.55. 
That’s because the game ends when Allan has all $3 at stake, which he could achieve 
in one step with probability p=.55. Having done so, he will “stay at $3” in the 
imaginary second game/step, i.e., from a mathematical perspective, the observed 
sequence of the Markov chain steps Xo, X;, and X> is $2 — $3 — $3, with the second 
transition occurring with probability 1. 

A natural concern from Allan’s perspective is the likelihood that he will eventu- 
ally win. One way to estimate that probability is to look at the chance Allan has 
arrived at the $3 “state” after some large number of steps. (This works because once 
he has $3, he will always remain at $3.) Matlab can easily calculate high powers of 
small matrices; we requested Pp”: 


1 0 0 0 
5980 0 0000 .4020 
.2691 .0000 0 .7309 

0 0 0 1 


p> = 
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The two entries that read .0000 indicate that the probability is not strictly 0, but 
rather is 0 to four decimal places. From this matrix, we have that 


P(Allan eventually has $3 | Xo= $2) =~) P(Allan has $3 after 75 steps | Xo= $2) 
= P') ($2 — $3) = .7309 


Had Allan started with just $1, he would have a roughly .4020 chance of 
eventually winning all the money. 

In Sect. 6.5, we will present an exact method for determining the probability that 
Allan eventually wins (or loses) his competition with Beth. = 


6.2.3 Exercises: Section 6.2 (11-22) 


11. The authors of the article “The Fate of Priority Areas for Conservation in 
Protected Areas: A Fine-Scale Markov Chain Approach” (Envir. Mgmnt., 
2011: 263-278) postulated the following model for landscape changes in the 
forest regions of Italy. Each “pixel” on a map is classified as forested (F’) or 
non-forested (NF). For any specific pixel, X,, represents its status n years after 
2000 (so X, corresponds to 2001, X> to 2002, and so on). Their analysis showed 
that a pixel has a 90% chance of being forested next year if it is forested this 
year and an 11% chance of being forested next year if it non-forested this year; 
moreover, data in the twenty-first century are consistent with the assumptions 
of a Markov chain. 

(a) Construct the one-step transition matrix for this chain, with states | = F and 
2=NF. 

(b) If a map pixel was forested in the year 2000, what is the probability it was 
still forested in 2002? 2013? 

(c) If a map pixel was non-forested in the year 2000, what is the probability it 
was still non-forested in 2002? 2013? 

(d) The article’s authors use this model to project forested status for several 
Italian regions in the years 2050 and 2100. Comment on the assumptions 
required for these projections to be valid. 

12. A large automobile insurance company classifies its customers into four risk 
categories (1 being the lowest risk, aka best/safest drivers, 4 being the worst/ 
highest risk; premiums are assigned accordingly). Each year, upon renewal of a 
customer’s insurance policy, the risk category may change depending on the 
number of accidents in the previous year. Actuarial data suggest the following: 
category | customers stay in category 1 with probability .9 and move to 
categories 2, 3, 4 with probabilities .07, .02, and .01, respectively. Category 
2 customers shift to category | (based on having no accidents last year) with 
probability .8 and rise to risk categories 3 and 4 with probabilities .15 and .05, 
respectively. Similarly, category 3 customers transition to 2 and 4 with 


538 


13. 


14. 


6 Markov Chains 


probabilities .7 and .3, while category 4 customers stay in that risk category 

with probability .4 and move to category 3 otherwise. 

(a) Let X, denote a customer’s risk category for his/her nth year with the 
insurance company. Construct the one-step transition matrix for this Mar- 
kov chain. 

(b) If a customer starts in category 1, what is the probability she falls into risk 
category 2 five years later? 

(c) If a customer is currently in risk category 4, determine the probability he 
will be a category | driver in k years, for k= 1, 2, 3, 4, 5, 6. 

(d) What is the probability that a driver currently in category 1 remains in that 
category for each of the next 5 years? 

The article cited in Example 6.6 also gives the following one-step transition 

matrix, with the same definitions of states, for Willow City, ND: 


G S 
p- G | .933 .067 
~ S$} .012 .988 


(a) Contrast Willow City with New York City: where is snow more likely to 
stay on the ground for an extended time period? Explain. 

(b) If today is a snowy day in Willow City, what is the probability it will also 
be a snowy day there 2 days from now? three days from now? 

(c) If today is a snowy day in Willow City, what is the probability it will 
continue to be snowy for the next 4 days in a row? 

I (author Carlton) have a six-room house whose configuration is depicted in the 

accompanying diagram. When my sister and her family visit, I often play hide- 

and-seek with my young nephew, Lucas. Consider the following situation: 

Lucas counts to ten in Room 1, while I run and hide in Room 6. Lucas’ 

“strategy,” as much as he has one, is such that standing in any room of the 

house, he is equally likely to next visit any of the adjacent rooms, regardless of 

where he has searched previously. (The exception, of course, is if he enters 

Room 6, in which case he discovers me and the round of hide-and-seek is over.) 
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(a) Let X,,=the nth room Lucas visits (with Xo= 1, his starting point). Con- 
struct the one-step transition matrix for the corresponding Markov chain. 

(b) What is the probability that his third room-to-room transition will take him 
into Room 2? 

(c) What is the fewest number of time steps (i.e., room transitions) required for 
Lucas to find me? 

(d) What is the probability that, after 12 time steps, he still hasn’t found me? 

Refer back to Exercise | in the previous section. Consider two negotiators, A 

and B, who employ strategies according to the Markov chain model described. 

(a) Construct the one-step transition matrix for the Markov chain X,, = strategy 
employed at the nth stage of a negotiation, assuming the states are (1) coop- 
erative and (2) competitive. 

(b) If negotiator A employs a cooperative strategy at some stage, what is the 
probability she uses a competitive strategy the next time? [Don’t forget that 
A’s turns are two time steps apart, since B counter-negotiates in between. ] 

(c) Now introduce a third state, (3) end of the negotiation. Assume that a 
Markov chain model with the following one-step transition matrix applies: 


6 2 2 
P= }.3 4 3 
0 0 1 


Given that the initial strategy presented is cooperative, what is the proba- 
bility the negotiations end within three time steps? 

(d) Refer back to (c). Given that the initial strategy presented is competitive, 
what is the probability the negotiations end within three time steps? 

Sarah, a statistician at a large Midwestern polling agency, owns four umbrellas. 

Initially, two of them are at her home and two are at her office. Each morning, 

she takes an umbrella with her to work (assuming she has any at home) if and 

only if it’s currently raining, which happens on 20% of mornings. Each 
evening, she takes an umbrella from work to home (again, assuming any are 
available) if and only if it’s raining when she leaves work, which happens on 

30% of all evenings. Assume weather conditions, including morning and 

evening on the same day, are independent (in the Midwest, that’s not unrealis- 

tic). Let X,, =the number of umbrellas Sarah has at home at the end of her nth 
work day (i.e., once she’s back at home). 

(a) Identify the state space for this chain. 

(b) Assume Sarah has two umbrellas at home tonight. By considering all 
possible weather conditions tomorrow morning and tomorrow evening, 
determine the one-step transition probabilities for the number of umbrellas 
she’ll have at home tomorrow night. 

(c) Repeat the logic of (b) to determine the complete one-step transition matrix 
for this chain. Be careful when considering the two extreme cases! 

Refer back to the previous exercise. 
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(a) Given that Sarah has two umbrellas at home (and two at work) as of Sunday 
night, what is the probability she’ll have exactly two umbrellas at home the 
following Friday night? What is the probability she’ll have at least two 
umbrellas at home the following Friday night? 

(b) Given that Sarah has two umbrellas at home Sunday night, what are the 
chances she won’t have an umbrella to take with her to work the following 
Thursday morning when a surprise thunderstorm moves through the area? 

(c) Assume again that Sarah has two umbrellas at home at the start of the week. 
Determine the expected number of umbrellas she has at home at the end of 
Monday and at the end of Tuesday. [Hint: X,, is a discrete rv; if Xp = 2, then 
the probability distribution of X,, appears in the corresponding row of P”.] 

A box always contains three marbles, each of which is green or yellow. At 
regular intervals, one marble is selected at random from the box and removed, 
while another is put in its place according to the following rules: a green marble 
is replaced by a yellow marble with probability .3 (and otherwise by another 
green marble), while a yellow marble is equally likely to be replaced by either 
color. Let X,,= the number of green marbles in the box after the nth swap. 

(a) What are the possible values of X,,? 

(b) Construct the one-step transition matrix for this Markov chain. 

(c) If all three marbles currently in the box are green, what is the probability 
the same will be true three swaps from now? 

(d) If all three marbles currently in the box are green, what is the probability 
that the fourth marble selected from the box will be green? [Hint: Use part 
(c). Be careful not to confuse the color of the marble selected on the fourth 
swap with the color of the one that replaces it!] 

A Markov chain model for customer visits to an auto repair shop is described in 
the article “Customer Lifetime Value Prediction by a Markov Chain Based 
Data Mining Model: Application to an Auto Repair and Maintenance Company 
in Taiwan” (Scientia Iranica, 2012: 849-855). Customers make between 0 and 
4 visits to the repair shop each year; for any customer that made exactly i visits 
last year, the number of visits s/he will make next year follows a Poisson 
distribution with parameter ji;. (The event “4 visits” is really “4 or more visits,” 
so the probability of 4 visits next year is calculated as 1 — Y?_o p(x; y,) from 
the appropriate Poisson pmf.) Parameter values cited in the article, which were 
estimated from real data, appear in the accompanying table. 


oe ee: 1 2 3 4 
uw; | 1.938696 1.513721 1.909567 2.437809 3.445738 


(a) Construct the one-step transition matrix for the chain X,, = number of repair 
shop visits by a randomly selected customer in the nth observed year. 
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(b) If a customer made two visits last year, what is the probability that s/he 
makes two visits next year and two visits the year after that? 

(c) If a customer made no visits last year, what is the probability s/he makes a 
total of exactly two visits in the next 2 years? 

The four vans in a university’s vanpool are maintained at night by a single 

mechanic, who can service one van per night (assuming any of them need repairs). 

Suppose that there is a 10% chance that a van working today will need service by 

tonight, independent of the status of the other vans. We wish to model X,, = the 

number of vans unavailable for service at the beginning of the nth day. 

(a) Suppose all four vans were operational as of this morning. Find the 

probability that exactly 7 of them will be unusable tomorrow morning for 

jJ=0, 1, 2, 3. [Hint: The number of unusable vans for tomorrow will be 

1 less than the number that break down today, unless that’s 0, because the 

mechanic can fix only one van per night. What is the probability distribu- 

tion of Y=the number of vans that break down today, assuming all 

4 worked this morning? ] 

Suppose three vans were operational as of this morning, and one was 

broken. Find the probabilities P(1 — /) for this chain. [Hint: Assume the 

broken van will be fixed tonight. Then the number of unavailable vans 
tomorrow morning is the number that break down today, out of the three 
currently functioning. ] 

(c) Use reasoning similar to that of (a) and (b) to determine the complete 
one-step transition matrix for this Markov chain. 

Refer back to the previous exercise. 

(a) If all four vans were operational as of Monday morning, what is the 
probability exactly three vans will be usable Wednesday morning? 
Thursday morning? Friday morning? 

(b) A backlog occurs whenever X,,> 1, indicating that some vans will be 
temporarily out of commission because the mechanic could not get to 
them the previous night. Assuming there was no backlog as of Monday 
morning, what is the probability a backlog exists Tuesday morning? 
Answer the same question for Wednesday, Thursday, and Friday mornings. 

(c) How do the probabilities in (b) change if there was a backlog of | van as of 
Monday morning? 

Consider a Markov chain with state space {1, 2, ..., s}. Show that, for any 

positive integers m and n and any states i and /, 


(b 


we 


(m+n) : (m) p(n) 
a » Pie Pay 
=] 


This is an alternative version of the Chapman—Kolmogorov Equations. [Hint: 
Write the left-hand side as P(Xin4,=JjlXo =1), and consider all the possible 
states after m transitions. ] 
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6.3 Specifying an Initial Distribution 


Thus far, every probability we have considered in this chapter (i.e., all the one-, 
two-, and higher-step transition probabilities) has been conditional. For example the 
entries of any one-step transition matrix indicate P(X,,,, =/IX,, =i). In this section, 
we briefly explore unconditional probabilities, which result from specifying a 
distribution for the rv Xo, the initial state of the chain. We will consider two 
cases: modeling the initial state Xo as a random variable, and treating Xo as having 
a fixed/known value. 


Example 6.15 (Example 6.11 continued) The never-ending saga of the taxi driver 
continues! Imagine this poor fellow sleeps in his taxi, so from his perspective each 
new day starts in a “random” zone. Specifically, suppose for now that he has a 20% 
chance of waking up in zone 1, a 50% chance of waking up in zone 2, and a 30% 
chance of waking up in zone 3. That is, we have assigned the following initial 
distribution to the Markov chain: 


i 1 2 3 (6.3) 
P(Xo =i) 2 5 


Notice that, unlike the conditional probabilities that comprise the transition 
matrix of the Markov chain, this initial distribution (6.3) specifies the unconditional 
(aka marginal) distribution for the rv Xo. In what follows, we will sometimes refer 
to the bottom row of (6.3) as the “initial probability vector” of Xo. 

Now consider the rv X,, the destination of the taxi driver’s first fare. The 
probability his first fare wants to go somewhere in zone 3 can be determined via 
the Law of Total Probability: 


P(X = 3) = P(Xo = 1)P(X1 = 3 | Xo = 1) + P(Xo = 2) P(X =3 | Xo = 2) 
+P (Xo = 3)P(X1 = 3 | Xo = 3) 
3 


3 
= dP Xo = i)P(i > 3)) = DPX = ipa) 
= (.2)(.5) + (.5) (.1) + (3)(.2) =.21 


As indicated in the intermediate step, this unconditional probability can be 
computed by taking the product of the initial probability vector [.2 .5 .3], regarded 
as a 1 x3 matrix, with the third column of the one-step transition matrix P. 
Similarly, the (unconditional) probability that his first fare wants to be dropped 
off in zone 2 is 
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The foregoing computation is the product of the initial probability vector with 
the second column of P. Finally, the probability that the first fare is taken to zone 
1 equals .23, which can be computed either as a similar product or by observing that 
1 —(.21+.56) =.23. All together, the unconditional pmf of the rv X, is 


ae | 2 3 
PX,;=i)| 23 56 ai 


Clearly, the most efficient way to determine the distribution of X, is to compute 
all three products simultaneously through matrix multiplication. If we multiply the 
transition matrix P on the left by a 1x3 row vector containing the initial 
probabilities for Xo, we obtain 


The method illustrated in the preceding example can be generalized to find the 
unconditional distribution of the state X,, in the chain after any number of transitions 
n, starting with a specified initial distribution for Xo. 


THEOREM 

Let Xo, X,, ..., Xn, ... be a Markov chain with state space {1, ..., 5} and 
one-step transition matrix P. Let vo = [v9 .. . Vos] be a 1 X s vector specifying 
the initial distribution of the chain, i.e., vo, = P(X9 =k) fork=1,..., s. If vy 
denotes the vector of marginal (1.e., unconditional) probabilities associated 
with X,, then 


v= VoP 


More generally, if v,, denotes the 1 x s vector of marginal probabilities for 
X,, then 
Vn = VoP” 


Proof The formula v;=voP can be established using the same computational 
approach displayed in Example 6.15. Now consider v2, the vector of unconditional 
probabilities for X>. By the same reasoning as in Example 6.15, we have 


Vv. = v\P 


The substitution vj = VoP then yields v2 = (WoP)P = VoP”. Continuing by induc- 
tion, we have for general n that v,, =v,_|P = (voP” ')P = VoP”, as claimed. | 
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With the aid of software such as Matlab, the unconditional distributions of future 
states of the Markov chain can be computed very quickly once the initial distribu- 
tion is specified. For example, as a continuation of Example 6.15, the probability 
vector for X>, the destination of the driver’s second fare, is given by 


3.2 5 
vo =vjP=[.23 56 .21]}.1 8 1) =[.209 578 .213] 
4 4 2 


or, equivalently, 


2 


3.2 °5 
vo =VvwoP?=[.2 5 3]].1 8 1] =[.209 578 .213] 
a a 


That is, assuming that the initial distribution specified in Example 6.15 is 
correct, the taxi driver has a 20.9% chance of taking his second fare to zone 1, a 
57.8% chance of taking him/her to zone 2, and a 21.3% chance of being in zone 
3 after two fares. 


Example 6.16 As you probably learned in high school biology, Austro-Hungarian 
scientist Gregor Mendel studied the inheritance of characteristics within plant 
species, particularly peas. Suppose one particular pea plant can either be green or 
yellow, which is determined by a single gene with green (G) dominant over yellow 
(g). That is, the genetic material determining a plant’s color (its “genotype”’) can be 
one of three pairings—GG, Gg, or gg—depending on which types were passed on 
by the parent plants. To say that green is “dominant” over yellow means that the 
plant’s visible color—its “phenotype”’—will be green unless that gene is 
completely absent from the plant (so plants with GG or Gg genotype appear 
green, while only gg plants are yellow). 

Consider cross-breeding with a yellow plant, whose genotype is therefore known 
to be gg. Mendel’s laws of genetic recombination can be expressed by the following 
transition matrix, where X,, is the genotype of an nth-generation plant resulting from 
cross-breeding with a gg plant: 


GG|0 1 0O 
P=Gg|0 5 5 
gg |O O 1 


For example, crossing GG x gg yields Gg with probability 1, while Gg x gg 
results in Gg or gg with probability .5 each. 

Suppose our initial population of plants (to be cross-bred with the pure 
yellow specimens) has the following genotype distribution: 70% GG, 20% Gg 
and 10% gg. The initial probability vector associated with this “Oth generation” is 
Vo=[.7 .2 .1]. The probabilities associated with the first generation of cross-bred 
plants is v} =vopP=[0 .8 .2], meaning that 80% of first-generation plants are 
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expected be Gg and the remaining 20% gg. Notice that GG plants cannot exist past 
the first generation, since cross-breeding with gg plants makes such a recombina- 
tion impossible. 

Similarly, the  second-generation probabilities are given by 
V2 =v ,P= VoR? = [0 .4 .6], so that within two generations gg plants should be the 
majority (60% gg compared to 40% Gg). As cross-breeding with pure gg plants 
continues, that genotype will increase in relative proportion (80% in generation 
3, 90% in generation 4), until eventually the dominant G allele dies out. | 


6.3.1. A Fixed Initial State 


The case is which the initial state Xo is fixed or known rather than random can be 
handled by forming a “degenerate” initial probability distribution. 


Example 6.17 (Example 6.15 continued) Suppose that our taxi driver lives in 
zone 3 and always goes home at night, which means that he starts each new day 
in zone 3. Starting with certainty in zone 3 means that P(X>=3)=1, while 
P(Xo= 1) = P(Xp = 2) = 0. Written as a pmf, the distribution of Xo is 


i). 4 2 3 
PX =i) | 0 


Equivalently, the probability vector for Xo is Vp =[0 O 1]. From the original 
description of the Markov chain (Example 6.1), the initial state being zone 3 implies 
that X; = 1 with probability .4, X, =2 with probability .4, and X; =3 with proba- 
bility .2. This same result can be obtained by applying the theorem of this section: 


a 2 4 
vi=vwP=(0 0 1]/.1 8 1) =[4 4 2] 
4 4 2 


Notice that left-multiplying P by the vector [0 0 1] simply extracts the third row 
of P. Similarly, the pmf of X5, the destination of the fifth passenger, is given by 


: 2115 .5767 .2118 


a 2: 3 
Vs=wP?=[0 0 1]].1 8 1] =[0 O 1]].1938 .6125 .1938 
4 4 2 2073 5858 .2070 


= [.2073 .5858 .2070] 


The matrix P° was computed by Matlab. The row vector vs is simply the third 
row of P®, because the chain begins in zone 3 with probability 1. So, starting the day 
at home in zone 3, the taxi driver finds himself in zone 1, 2, or 3 after five fares with 
probabilities .2073, .5858, and .2070, respectively. a 
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Example 6.18 (Example 6.14 continued) As before, we can use a high power of 
the one-step transition matrix, say P’>, to approximate the long-term behavior of 
our Gambler’s Ruin Markov chain. Suppose as before that p=.55 and Allan’s 
initial stake is $2. We can express the latter as vp = [0 0 1 0]; recall that the states, in 
order, are $0, $1, $2, $3. Then the probability distribution of X75 is 


V75 =VoP”> =[0 0 1 OJP”° = the third (i.e., $2) row of P” = [.2691 .0000 0 .7309] 


If Allan begins the competition with $2 (and Beth with $1), there is a 73.09% 
chance he will end up with all the money within 75 games, and a 26.91% chance he 
will end up broke after 75 games. As discussed previously, the competition will 
almost certainly end long before a 75th game, but for purposes of forecasting long- 
run behavior we imagine that when either player goes broke, game-play continues 
but no further money is exchanged. 

Suppose instead that Allan’s initial stake is just $1, while Beth starts with $2. 
Then Allan’s initial “distribution” is specified by vo>=[0 1 O OJ], meaning 
P(X) = $1) =1 while P(Xp = $0, $2, $3) =0. After 75 plays, we now have 


V75 =VoP”> = [0 1 0 0JP’> = the second (i.e., $1) row of P” = [.5980 0 .0000 .4020] 


Starting with $1, Allan has a 40.2% chance of winning the competition (i.e., 
ending up with $3) and a 59.8% chance of being “ruined.” a 


6.3.2 Exercises: Section 6.3 (23-30) 


23. Refer back to Exercise 1 of this chapter. Suppose that Negotiator A goes first 
and that 75% of the time she begins negotiations with a cooperative strategy. 
(Consider this to be time index 0.) 

(a) Determine the (unconditional) probability that Negotiator B’s first strategy 
will also be cooperative. 

(b) Determine the (unconditional) probability that Negotiator B’s second strategy 
will be cooperative. [Hint: Which time index corresponds to his second move? ] 

24. Refer back to the Ehrenfest chain model described in Exercise 2 with m=3 
balls. The possible states of the chain X,,= number of balls in the left chamber 
after the nth exchange are {0, 1, 2, 3}. 

(a) Suppose that all four possible initial states are equally likely. Determine the 
probability distributions of X, and X5. 

(b) Suppose instead that each of the three balls is initially equally likely to be 
placed in the left or right chamber. In this situation, what is the initial 
distribution of the chain? 

(c) Using the initial distribution specified in (b), determine the unconditional 
distributions of X, and X>. What do you notice? 
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25. 


26. 


27. 


28. 
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Information bits (Os and 1s) in a binary communication system travel through a 

long series of relays. At each relay, a “bit-switching” error might occur. 

Suppose that at each relay, there is a 4% chance of a 0 bit being switched to 

a | bit and a 5% chance of a 1 becoming a 0. Let Xp = bit’s initial parity (0 or 

1), and let X,,= the bit’s parity after traversing the nth relay. 

(a) Construct the one-step transition matrix for this chain. [Hint: There are only 
two states, 0 and 1.] 

(b) Suppose the input stream to this relay system consists of 80% Os and 20% 
1s. Determine the proportions of Os and 1|s exiting the first relay. 

(c) Under the same conditions as (b), determine the proportions of Os and Is 
exiting the fifth relay. 

Refer to the genetic recombination scenario of Example 6.16. Suppose that 

plants will now be cross-bred with known hybrids (i.e., those with genotype 

Gg). Mendel’s laws imply the following transition matrix for such breeding: 


GG;} 5 5 0O 
P= Gg].25 5 .25 
ge | 0 5 55 


Again assume the initial population genotype distribution of plants to be 
cross-bred with these hybrids is 70% GG, 20% Gg, and 10% gg. 

(a) Determine the genotype distribution of the first generation of plants 
resulting from this cross-breeding experiment. 

(b) Determine the genotype distributions of the second, third, and fourth 
generations. 

Refer to the weather scenario described in Example 6.6 and Example 6.10. 

Suppose today’s weather forecast for New York City gives a 20% chance of 

experiencing a snowy day. 

(a) Let Xo denote today’s weather condition. Express the information provided 
as an initial probability vector for Xo. 

(b) Determine the (unconditional) likelihoods of a snowy day and a green day 
tomorrow, using the one-step transition probabilities specified in Example 
6.6. 

(c) Based on today’s forecast and the transition probabilities, what is the 
chance New York City will experience a “green day” 1 week (7 days) 
from now? 

The article “Option Valuation Under a Multivariate Markov Chain Model” 

(Third International Joint Conference on Computational Science and Optimi- 

zation, 2010) includes information on the dynamic movement of certain assets 

between three states: (1) up, (2) middle, and (3) down. For a particular class of 
assets, the following one-step transition probabilities were estimated from 
available data: 
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4069 = .3536 = .2395 
P= | 3995 5588 = .0417 
5642 .0470 = .3888 


Suppose that the initial valuation of this asset class found that 31.4% of such 
assets were in the “up” dynamic state, 40.5% were “middle,” and the remainder 
were “down.” 

(a) What is the initial probability vector for this chain? 

(b) Determine the unconditional probability distribution of X,, the asset 
dynamic state one time step after the initial valuation. 

(c) Determine the unconditional probability distribution of X2, the asset 
dynamic state two time steps after the initial valuation. 

Refer back to Exercise 23, and now suppose that Negotiator A always opens 

talks with a competitive strategy. 

(a) What is the probability vector for Xo, Negotiator A’s initial strategy? 

(b) Without performing any matrix computation, determine the distribution of 
X,, Negotiator B’s first strategy choice. 

(c) What is the probability Negotiator A’s second strategy is cooperative? 
competitive? 

Transitions between sleep stages are described in the article “Multinomial 

Logistic Estimation of Markov-Chain Models for Modeling Sleep Architecture 

in Primary Insomnia Patients” (J. Pharmacokinet. Pharmacodyn., 2010:137- 

155). The following one-step transition probabilities for the five stages awake 

(AW), stage | sleep (ST1), stage 2 sleep (ST2), slow-wave sleep (SWS), and 

rapid-eye movement sleep (REM) were obtained from a graph in the article: 


AW |.90 .09 .01 .00 .00 
ST1 | .21 40 .34 .02 .03 
P= ST2 | .02 02 .84 .09 .03 
SWS | .02 .02 .22 .72 .02 
REM | .04 .04 .05 .00 .87 


The time index of the Markov chain corresponds to half-hour intervals (i.e., 
n=1 is 30 min after the beginning of the study, n=2 is 60 min in, etc.). 
Initially, all patients in the study were awake. 

(a) Let Vp denote the probability vector for Xo, the initial state of a patient in the 
sleep study. Determine Vo. 

(b) Without performing any matrix computations, determine the distribution of 
patients’ sleep states 30 min (one time interval) into the study. 

(c) Determine the distribution of patients’ sleep states 4 h into the study. [Hint: 
What time index corresponds to the 4-h mark?] 
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6.4 Regular Markov Chains and the Steady-State Theorem 


In previous sections, we have alluded to the long-term behavior of certain Markov 
chains. In some cases, such as Gambler’s Ruin, we anticipate that the chain will 
eventually reach, and remain in, one of several “absorbing” states (we'll discuss 
these in Sect. 6.5). Our taxi driver, in contrast, should continually move around, but 
perhaps something can be said about how much time he will spend in each of the 
three zones over the course of many, many fares. It turns out that the taxi driver 
example belongs to a special class of Markov chains, called regular chains, for 
which the long-run behavior “stabilizes” in some sense and can be determined 
analytically. 


6.4.1 Regular Chains 


DEFINITION 
A finite-state Markov chain with one-step transition matrix P is said to be a 
regular chain if there exists a positive integer n such that all of the entries of 
the matrix P” are positive. 

In other words, for a regular Markov chain there is some positive integer 
n such that every state can be reached from every state (including itself) in 
exactly n steps. 


It’s straightforward to show that if all the entries of P” are positive, then so are all of 
the entries of py pr? and so on (Exercise 37). Our taxi driver example is a 
regular chain, since all nine entries of P itself are positive. The next example shows 
that a regular chain may have some one-step transition probabilities equal to zero. 


Example 6.19 Internet users’ browser histories can be modeled as Markov chains, 
where the “states” are different Web sites (or classes of Web sites) and transitions 
occur when users move from one Web site to another. The article “Evaluating 
Variable-Length Markov Chain Models for Analysis of User Web Navigation 
Sessions” (EEE Trans. Knowl. Data Engr. 2007: 441-452) discusses increasingly 
complex models of this type. Suppose for simplicity that Web sites are grouped into 
five categories: (1) social media, (2) e-mail, (3) news and sports, (4) online retailers, 
and (5) other (use your imagination). Consider a Markov chain model for users’ 
transitions between these five categories whose state diagram is depicted in Fig. 6.5. 

Notice that, according to this model, not every state can access all five states in 
one step, because many one-step transition probabilities are zero. The one-step 
transition matrix P of this Markov chain is as follows: 
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Fig. 6.5 State diagram for 
Example 6.19 


0 3 0 7 O 
2 1 6 0 11 
P=|0 0 2 4 4 
7 0 1 2 0 
0 0 5 5 O 


Eleven of the twenty-five entries in P are zero. However, consider several higher 
powers of this matrix: 


55.03) 25 14.03 104 .168 .097 528 .103 
02 07 .23 43 .25 315 013) .256 .317 =~ .099 
P?—|.28 0 28 36 08}, P>=|.252 .084 .132 .420 .112 
14 21 04 57 .04 441 063.211 .248 = .037 
35 0 15.30 .20 210 .105 .160 .465 .060 


Since every entry of P* is positive, by definition we have a regular Markov chain. 
Every state can reach every state (including itself) in exactly three moves. a 


In contrast, Gambler’s Ruin is not a regular Markov chain. It is not possible for 
Allan to go from $2 to $1 in an even number of moves, so the ($2, $1) entry of P” is 
zero whenever n is even. Similarly, Allan cannot go from $2 back to $2 in an odd 
number of steps, so the ($2, $2) entry of P” equals zero for every odd exponent n. 
Thus, there exists no positive integer n for which all sixteen entries of P” are 
positive. (In fact, six other entries of P” must always be 0: P”(0—/) =O for 
j40 and P“(3 — j)=0 for 43, since both $0 and $3 are “absorbing” states.) 
Another non-regular Markov chain, one that does not have any absorbing states, is 
given in the following example. 


Example 6.20 Unlike our taxi driver, bus drivers follow a well-defined route. 
Consider a bus route from campus (state 1), to the nearby student housing complex 
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Fig. 6.6 State diagram for 1 
Example 6.20 


(state 2), to downtown (state 3), and then back to campus. The associated Markov 
chain cycles endlessly: 1—2—-3—1—-2-—3-—1 .... Figure 6.6 shows the 
corresponding state diagram. 

The one-step transition matrix for this chain is 


0 1 0 
P=/0 01 
1 0 0 
Direct computation shows that 
0 0 1 1 0 0 
P-—=/1 0 0] and P=/0 1 O}] =L 
0 1 0 0 0 1 


where I denotes the 3x3 identity matrix. Hence, p*=P°p=IP= P; 
P> = P*p? = IP” P’: pe=P°P?e=T1 I; and so on. That is, the n-step transition 
matrix P” equals one of P, P”, or I for every positive integer n, and all three of these 
contain some zero entries. Therefore, this is not a regular Markov chain. | 


6.4.2 The Steady-State Theorem 


What’s so special about regular chains? The transition matrices of regular Markov 
chains exhibit a rather interesting property. Consider a very high power of the 
transition matrix for our taxi driver, computed with the aid of Matlab: 


2000 .6000 .2000 
= P! — | 2000 .6000 .2000 
2000 .6000 .2000 


New 


3 42 
P=;.1 8 
4 4 


Notice that every row of P'” is identical: roughly, each one is [.2 .6 .2]. What’s 
more, raising P to even higher powers yields the same matrix to several decimal 
places. That is, p'°!, p' and so on are all roughly equal to pi. Something 
similar occurs for the regular Markov chain of Example 6.19: 
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0 3 0 7 O 2844 .0948 .1659 .3791 .0758 

2 1 6 0 11 2844 .0948 .1659 .3791 .0758 

P=|0 0 2 4 4] => P!= | 2844 .0948 .1659 .3791 .0758 

7 0 1 2 O 2844 .0948 .1659 .3791 .0758 

0 0 5 5 O 2844 .0948 .1659 .3791 .0758 
Again, every row of P!° is the same, and replacing 100 by an even higher power 


gives the same result (i.e., to several decimal places P'°?= P'?! =p! —...). 
These are two examples of the central result in the theory of Markov chains, the 
so-called Steady-State Theorem. 


STEADY-STATE THEOREM 
Let P be the one-step transition matrix of a finite-state, regular Markov chain. 
Then the matrix limit 
II = lim P” (6.4) 
n—-Oo 
exists. Moreover, the rows of the limiting matrix II are identical, with all 
positive entries. 


The proof of the Steady-State Theorem is beyond the scope of this book; 
interested readers may consult the text by Karlin and Taylor listed in the references. 

If we let = [m, - -- 2,] denote each of the identical rows of the limiting matrix IT 
in Eq. (6.4), 7 is called the steady-state distribution of the Markov chain. Thus, for 
the taxi driver example, the steady-state distribution is 7=[.2 .6 .2], while the 
steady-state distribution for the Web browsing Markov chain in Example 6.19 is 
m = [.2844 .0948 .1659 .3791 .0758]. 

A Markov chain does not have to be regular for the limit of P” to exist as n — oo. 
For example, computing progressively larger powers of the one-step transition 
matrix for the Gambler’s Ruin scenario of Example 6.14 shows that, for large n, 


1 0 0 0 
5980 0 .0000  .4020 
.2691  .0000 0 .7309 

0 0 0 1 


PsP? 


That is, the limit of P” exists and is, at least to four decimal places, equal to the 
matrix displayed above. However, unlike in the case of a regular Markov chain, the 
rows of this limiting matrix are not identical and the matrix includes several zeros. 
We will consider in more detail Markov chains of this type in the next section. 

The transition matrix of a “periodic” Markov chain, such as the one in Example 
6.20, does not have a limit. This is not surprising, since periodic functions in general 
do not have long-run limits but rather cycle through their possible values. 
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6.4.3 Interpreting the Steady-State Distribution 


The steady-state distribution 7 of a regular Markov chain can be interpreted in 
several ways. We present four different interpretations here; verifications of the 
second and fourth statements can be found in the Karlin and Taylor text. 

1. If the “current” state of the Markov chain is observed after a large number of 
transitions, there is an approximate probability x; of the chain being in state j. 
That is, for large n, P(X,=j)~7;. Moreover, this holds regardless of the 
initial distribution of the chain (i.e., the unconditional distribution of the initial 
state Xo). 

The first sentence is essentially the definition of 2 stemming from the Steady- 

State Theorem. 

2. The long-run proportion of time the Markov chain visits the jth state is 7;. 

To be more precise, for any state j let N,(n) denote the number of times the chain 
visits state j in its first n transitions; that is, 


Nj(n) = #{1<k <n: X, =j} 


Then it can be shown that Nj(m)/n, the proportion of time the Markov chain 

spends in state j among the first 1 transitions, converges in probability to 7;. 

3. If we assign x to be the initial distribution of Xo, then the distribution of X,, is 
also 1 for any subsequent number of transitions n. For this reason, 7 is custom- 
arily referred to as the stationary distribution of the Markov chain. 

To prove Statement 3, first let II denote the matrix in Eq. (6.4), each of whose 

rows is 7. Now write P”*' = P”P and take the limit of both sides as n — oo: 


p't! — pp = jim P"t! = tim [P"P] = [lim P" P=>I1=TP 
n-Co n-Co N—-0Oo 

Each side of the last equation is an s x s matrix; equating the top rows of these 
two matrices, we have 2 = aP. (You could just as well equate any other row, since 
all the rows of II are the same.) 

Now, assign the steady-state distribution to Xo: Vp = a. Then the (unconditional) 
distribution of X,, using the results of Sect. 6.3, is vj; =2P, which we have 
established equals a. Continuing by induction, we have for any n that the uncondi- 
tional distribution of X,, is v, = V,_,P =xP =, completing the proof. 

4. The expected number of transitions required to return to the jth state, beginning 
in the jth state, is equal to 1/n;. This is called the mean recurrence time for 
state j. 

Compare this result to the mean of a geometric rv from Chap. 2: the expected 
number of trials (replications) required to first observe an event whose probability is 
p equals 1/p. The difference is that the geometric model assumes the trials are 
independent, while a Markov chain model assumes that successive states of the 
chain are dependent (as specified by the Markov property). But if we think of 
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“return to the jth state” as our event of interest, then Statement | implies that 
(at least for large n) the probability of this event is roughly z;, and so it seems 
reasonable that the average number of tries/steps it will take to achieve this event 
will be 1/n;. 


Example 6.21 The steady-state distribution of the taxi driver example is the 3 x 1 

vector a = [.2 .6 .2]. For now, this relies on the computation of P'” above; shortly, 

we will present a derivation of this vector that does not require raising P to a high 
power. From the preceding descriptions, we conclude all of the following: 

1. Regardless of where the taxi driver starts his day, for large n there is about a 20% 
chance his nth fare will be dropped off in zone 1, a 60% chance that that fare 
will go to zone 2, and a 20% chance for zone 3. 

2. In the long run, the taxi driver drops off about 20% of his fares in zone 1, about 
60% in zone 2, and about 20% in zone 3. 

3. Suppose the taxi driver sleeps in his cab, thus waking up each day in a “random” 
zone, and we assign to Xo (his point of origin tomorrow, say) the initial 
distribution Vp = a = [.2 .6 .2]. The unconditional distribution of X,, the destina- 
tion of tomorrow’s first fare, is 


3.2 3 
vi=vP=[(2 6 2]}.1 8 1 
4 4 2 


By direct computation, the first entry of v, is (.2)(.3) +(.6)(..1) + (.2)(4) =.2; 
the second entry is (.2)(.2) + (.6)(.8) + (.2)(.4) =.6; and the last is .2. That is, 
v, =[.2 .6 .2] =7, and so X, has the same distribution as Xp. The same will hold 
for X>, X3, and so on. 

4. If the driver starts from his home in zone 3, then on the average the number of 
fares he handles until he is brought back to zone 3 is given by 1/m3 = 1/(.2) =5. 
That is, the mean recurrence time for state 3 (zone 3) is five transitions. | 


6.4.4 Efficient Computation of Steady-State Probabilities 


The preceding examples of regular Markov chains and the resulting steady-state 
distributions may suggest that one determines a by computing a high power of 
the transition matrix P, preferably with software, and then extracting any row of the 
resulting matrix (all of which will be the same, according to the Steady-State 
Theorem). Fortunately there is a more direct technique for determining az. 
The method was hinted at in the proof of Statement 3 above: the steady-state 
distribution a satisfies the matrix equation xP=~7. In fact, something stronger 
is true. 
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THEOREM 
Let P be the one-step transition matrix of a regular Markov chain on the state 
space {1, ..., s}. The steady-state distribution of the Markov chain is the 


unique solution m =[z, --- 1,] to the system of equations formed by 


mP=x and m+---+2,=1 (6.5) 


Proof Statement 3 above and the fact that z is a probability vector (because it’s the 
limit of probability vectors) ensures that 7 itself satisfies Eq. (6.5). We must show 
that any other vector satisfying both equations in Eq. (6.5) is, in fact, a. To that end, 
let w be any 1 xs vector satisfying the two conditions wP=w and > w;=1. 
Similar to earlier derivations, we have wP? = (wP)P = wP = w and, by induction, 
wP” = w for any positive integer n. Taking the limit of both sides as n — oo, the 
Steady-State Theorem implies that wII = w. 
Now expand wII: 


TM ++: Ws 
wH=[wy--w} io: Gl = [(wim feet Ws} ) _ (wits te: + w4ms) | 
My ++: Ws 
= [(Zwi)m cae (Zw;)as| = (Zwi) [m1- . Ts = (Zw;)7 


Since }) w;=1 by assumption, we have wII=1. It was established above that 
wII=w, and so we conclude that w=7, as originally claimed. |_| 


Example 6.22 Consider again the Markov chain model for snowy days (S) and 
non-snowy or “green” days (G) in New York City, begun in Example 6.6. The 
one-step transition matrix was given by 
p= G| .964 .036 
~ §|.224 .776 


Since all the entries of P are positive, this is a regular Markov chain. The preceding 
theorem can be used to determine the steady-state probabilities m=[1, 22]. The 
equations in Eq. (6.5), written out long-hand, are 


9640, + .224n. = my 
.036n, + .776n2 = 12 
Ty, + TM =1 


Substituting 22 = 1 — 7, into the first equation gives .9642, + .224(1 —1,) =1; 
solving for x, produces 2, = .224/.260 = .8615 and then m= 1 — .8615 =.1385. 
For the season to which this model applies, in the long run New York City has at 
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least 50 mm of snow on 86.15% of days and less than 50 mm on the other 13.85% 
of days. 

It’s important to note that the top two equations alone, i.e., those provided by the 
relationship xP =z, do not uniquely determine the value of the vector a. The first 
equation is equivalent to .224125 = .036z, (subtract .9642, from both sides), but so is 
the second equation (subtract .77622 from both sides). The final equation, requiring 
the entries of 7 to sum to I, is necessary to obtain a unique solution. a 


Expression (6.5) may be reexpressed as a single matrix equation. Taking a 
transpose, 


xP =x => P'x! =x! =In' (Pt —I)x' = 0, 
where 0 is an sx 1 vector of zeros. The requirement 2,+---+2z,=1 can be 


rendered in matrix form as [1-- ‘Aa’ = [1], and so the system of Eq. (6.5) can be 
expressed with the augmented matrix 


| ; (6.6) 


Example 6.23 (Example 6.21 continued) To analytically determine the steady- 
state distribution of our taxi driver example, first construct the matrix P' — I: 


3 1 4 100 -7 1 4 
PT-I=/]2 8 4/-—]0 1 O] = 2-2 A 
5 1 2 001 5 1 —8 


Second, form the augmented matrix indicated in Expression (6.6), and then 
finally use Gauss-Jordan elimination to find its reduced row echelon form (e.g., 
with the rref command in Matlab): 


1 1 a4 10 0!2 
PA ALO) seem 0 1 OI 

2-2 A410 00 112 

5 1 -8;0 0 0 0; 


From the right-hand matrix, we infer that 1; =.2, m.=.6, and 13=.2. This 
matches our earlier deduction from the matrix P!°. | 


Example 6.24 For the Internet browser scenario of Example 6.19, the steady-state 
distribution can be determined as follows: 
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a & a 2tT) 10 0 0 0!.2840] 
t = fit 2 & 2 010 010 0 0 1.0948 
0] | .3 -9 0 0 0!0] gee 0 0 1 0 0!.1659 

T be [= a I 
PT-r . 6 = Asie 000 1 0;.3791 

| I 
iO] |.7 0 4 -8 510 0 0 0 0 11.0758 
}0 1 4 0 -1/0) 10 0000} oO 


That is, 1; = .2840, m2 = .0948, and so on; these match the results suggested 
earlier by considering P'®’. In the long run, about 28.40% percent of Web pages 
visited by Internet users under consideration are social media sites, 9.48% are for 
checking e-mail, 16.59% are news and sports Web sites, etc. Also, when a user 
finishes checking her or his e-mail online, the average number of Web sites visited 
until s/he checks e-mail again is 1/m2 = 1/.0948 = 10.55 (including the second login 
to e-mail). | 


6.4.5 Irreducible and Periodic Chains 


The existence of a stationary distribution is not unique to regular Markov chains. 


DEFINITION 

Let i and j be two (not necessarily distinct) states of a Markov chain. State j is 
accessible from state i (or, equivalently, i can access j) if P’ (i — j) > 0 for 
some integer n > 0.! A Markov chain is irreducible if every state is accessi- 
ble from every other state. 


It should be clear that every regular chain is irreducible (do you see why?). 
However, the reverse is not true: an irreducible Markov chain need not be a regular 
chain. Consider the cyclic chain of Example 6.20: the bus can access any of the 
three locations it visits (campus, housing, downtown) from any other location, so 
the chain is irreducible. However, as discussed earlier in this section, the chain is 
definitely not regular. The Ehrenfest chain model developed in Exercise 2 is another 
example of an irreducible but not regular chain; see Exercise 43 at the end of this 
section. 

It can be shown that any finite-state, irreducible Markov chain has a stationary 
distribution. That is, if P is the transition matrix of an irreducible chain, there exists 
a row vector zt such that xP = 7; moreover, there is a unique such vector satisfying 
the additional constraint )) 2;=1. For example, the cyclic bus route chain of 


' For n= 0, the symbol Pi — /) is interpreted as the probability of going from i to j in zero steps, 
and so necessarily P(i i) =1 for all i and P(i—j)=0 for i¢j. In particular, this means 
every state 7 is, by definition, accessible from itself. 
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Example 6.20 has stationary distribution m=[1/3 1/3 1/3], as seen by the 
computation 


0 1 0 
nP=[1/3 1/3 1/3]}0 0 1] =[1/3 1/3 1/3])=% 
10 0 


So, if the bus is equally likely to be at any of its three locations right now, it is 
also equally likely to be at any of those three places after the next transition (the 
“stationary” interpretation of 7c). This is true even though the chain is not regular, so 
the Steady-State Theorem does not apply. 

If an s-state Markov chain is irreducible but not regular, then every state can 
access every other state but there exists no integer n for which all s probabilities 
Pi j) are positive. The only way this can occur is if the chain exhibits some 
sort of “periodic” behavior, e.g., when one group of states can access some states 
only in an even number of steps and others only in an odd number of steps. 
Formally, the period of a state i is defined as the greatest common divisor of all 
positive integers n such that P“(i — i) > 0; if that ged equals 1, then state i is called 
aperiodic. All three states in the cyclic chain above have period 3, because for every 
state the period is gcd(3, 6, 9, ...)=3. It can be shown that every state in an 
irreducible chain has the same period; the chain is called aperiodic if that common 
period is 1 and is called periodic otherwise. 

As noted previously, for any regular Markov chain there exists an integer 1 such 
that all the entries of P”, P”*!, P”*, and so on are positive. Since the gcd of the set 
{n,nt+1,n+2,...} is 1, it immediately follows that every regular Markov chain is 
aperiodic. The following theorem characterizes regularity for finite-state chains. 


THEOREM 
A finite-state Markov chain is regular if, and only if, it is both irreducible and 
aperiodic. 


The “only if” direction of the theorem is established in the earlier paragraphs of 
this sub-section. The converse statement, that all irreducible and aperiodic finite- 
state chains are regular, can be proved using a result called the Frobenius coin- 
exchange theorem (we will not present the proof here). 


6.4.6 Exercises: Section 6.4 (31-43) 


31. Refer back to Mendel’s plant breeding experiments in Example 6.16 and 
Exercise 26. 
(a) Do the genotypes formed by successive cross-breeding with pure recessive 
plants gg, as in Example 6.16, form a regular Markov chain? 
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32. 


33. 


34. 


35. 


36. 


(b) Do the genotypes formed by successive cross-breeding with hybrid plans 
Gg, as in Exercise 26, form a regular Markov chain? 

Refer back to Exercise 2. Assume m = 3 balls are being exchanged between the 

two chambers. Is the Markov chain X,, = number of balls in the left chamber a 

regular chain? 

Refer back to Example 6.13 regarding cell phone contracts in China. 

(a) Determine the steady-state probabilities of this chain. 

(b) In the long run, what proportion of Chinese cell phone users will have 
contracts with China Mobile? 

(c) A certain cell phone customer currently has a contract with China Telecom. 
On the average, how many contract changes will s/he make before signing 
with China Telecom again? 

The article “Markov Chain Model for Performance Analysis of Transmitter 

Power Control in Wireless MAC Protocol” (Twenty-first International Confer- 

ence on Advanced Networking and Applications, 2007) describes a Markov 

chain model for the state of a communication channel using a particular 

“slotted non-persistent” (SNP) protocol. The channel’s possible states are 

(1) idle, (2) successful transmission, and (3) collision. For particular values 

of the authors’ proposed four-parameter model, the following transition matrix 

results: 


50 .40 .10 
P=j|.02 .98 0O 
12 0 .88 


(a) Verify that P is the transition matrix of a regular Markov chain. 

(b) Determine the steady-state probabilities for this channel. 

(c) What proportion of time is this channel idle, in the long run? 

(d) What is the average number of time steps between successive collisions? 

Refer back to Exercise 3. 

(a) Construct the one-step transition matrix P of this chain. 

(b) Show that X,, =the machine’s state (full, part, broken) on the nth day is a 
regular Markov chain. 

(c) Determine the steady-state probabilities for this chain. 

(d) On what proportion of days is the machine fully operational? 

(e) What is the average number of days between breakdowns? 

Refer back to Exercise 6, and assume three files A, B, C are to be repeatedly 

requested. Suppose that 60% of requests are for file A, 10% for file B, and 30% 

for C. Let X,, =the stacked order of the files (e.g., ABC) after the nth request. 

(a) Construct the transition matrix P for this chain. (The one-step transition 
probabilities were established in Exercise 6(c).) 

(b) Determine the steady-state probability for the stack ABC. 

(c) Show that, in general, the steady-state probability for ABC is given by 


560 


37. 


38. 


39. 


40. 


41. 
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where p, = P(file A is requested) and pz and pc are defined similarly. (The other 

five steady-state probabilities can be deduced by changing the subscripts 

appropriately.) 

Let P be the one-step transition matrix of a Markov chain. Show that if all the 

entries of P” are positive for some positive integer n, then so are all the entries 

of P”*!, P”*, and so on. [Hint: Write P”*! = P - P” and consider how the (i, j)th 

entry of P”*! is obtained. ] 

Refer back to Exercise 19. 

(a) Consider a new customer. By definition, s/he made no visits to the repair 
shop last year. What is his/her expected number of visits this year? 

(b) Now suppose a car owner has been a customer of this repair shop for many 
years. What is the expected number of shop visits s/he will make next year? 

Consider a Markov chain with just two states, 0 and 1, with one-step transition 

probabilities a= P(0— 1) and B=P(1— 0). 

(a) Assuming 0<a< 1and0</< 1. Determine the steady-state probabilities 
of states 0 and | in terms of a@ and f. 

(b) What happens if a and/or / equals 0 or 1? 

Occupational prestige describes how particular jobs are regarded by society 

and is often used by sociologists to study class. The article “Social Mobility in 

the United States as a Markov Process” (J. for Economic Educators, v. 8 

no. 1 (2008): 15-37) investigates the occupational prestige of fathers and 

sons. Data provided in the article can be used to derive the following transition 

matrix for occupational prestige classified as low (L), medium (M), or high (H): 


L | 5288 .2096 .2616 
P= M | .3688 .2530 = 3782 
H {| .2312 = .1738 = 5950 


(a) Which occupational prestige “state” is the most likely to self-replicate (i.e., 
father and son are in the same category)? Which is the least likely? 
(b) Determine the steady-state distribution of this Markov chain. 
(c) Interpret the distribution in (b), assuming the model specified by the matrix 
is valid across many generations. 
[Note: The authors actually used 11 categories of occupational prestige; we 
have collapsed these into three categories for simplicity.] 
The two ends of a wireless communication system can each be inactive (0) or 
active (1). Suppose the two nodes act independently, each as a Markov chain 
with the transition probabilities specified in Exercise 39. Let X,,=the “com- 
bined” state of the two relays at the mth time step. The state space for this chain 
is {00, 01, 10, 11}, e.g., state 01 corresponds to an inactive transmitter with an 
active receiver. (Performance analysis of such systems is described in “Energy- 
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Efficient Markov Chain-Based Duty Cycling Schemes for Greener Wireless 
Sensor Networks,” ACM J. on Emerging Tech. in Computing Systems 
(2012):1-32.) 

(a) Determine the transition matrix for this chain. [Hint: Use independence to 
uncouple the two states, e.g., P(00 — 10) = P(O— 1)-P(O—0).] 

(b) Determine the steady state distribution of this chain. 

(c) As the authors note, “a connection is feasible only when both wireless nodes are 
active.” What proportion of time is a connection feasible under this model? 

42. A particular gene has three expressions: AA, Aa, and aa. When two individuals 
mate, one half of each parent’s gene is contributed to the offspring (and each 
half is equally likely to be donated). For example, an AA mother can only 

donate A while an Aa father is equally likely to donate A or a, resulting in a 

child that is either AA or Aa. Suppose that the population proportions of AA, Aa, 

and aa individuals are p, q, and r, respectively (so p+q+r= 1). Consider the 
offspring of a randomly selected individual; specifically, let X,—=the gene 
expression of the oldest child in his or her nth generation of descendants 

(whom we assume will have at least one offspring). 

(a) Assume the nth-generation individual’s mate is selected at random from the 
genetic population described above. Show that P(X,,,,; = AAIX,, = AA) = p+q/2, 
P(Xn4, =AalX, =AA)=q/2+r, and P(X,4;=aalX,=AA)=0. [Hint: 
Apply the Law of Total Probability.] 

(b) Using the same method as in (a), determine the other one-step transition 
probabilities and construct the transition matrix P of this chain. 

(c) Verify that X, is a regular Markov chain. 

(d) Suppose there exists some a © [0, 1] such that p= a, q = 2a(1 — a), and 
r=(1- a)’. (In this context, a= P(A allele).) Show that x= [p q r] is the 
stationary distribution of this chain. (This fact is called the Hardy- 
Weinberg law; it establishes that the rules of genetic recombination result 
in a long-run stable distribution of genotypes.) 

43. Refer back to the Ehrenfest chain model of Exercises 2 and 24. Once again 
assume that m= 3 balls are being exchanged between the two chambers. 

(a) Explain why this is an irreducible chain, but not a regular chain. 

(b) Explain why each state has period equal to 2. 

(c) Show that the vector [1/8 3/8 3/8 1/8] is a stationary distribution for this 
chain. (Thus, even though the chain is not regular and the transition matrix 
P does not have a limit, there still exists a stationary distribution due to 
irreducibility.) 


6.5 Markov Chains with Absorbing States 


The Gambler’s Ruin scenario, begun in Example 6.2, has the feature that the chain 
“terminates” when it reaches either of two states ($0 or $3 in our version of the 
competition). As we’ve noted previously, it’s mathematically advantageous to 
imagine that the Markov chain actually continues in these cases, just never leaving 
the state 0 or 3; one such sample path is 
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2: 1-2-1 0—-0—-0-—-0 


In this section, we first investigate states from which a Markov chain can never 
exit and the time it takes to arrive in one of those states. 


DEFINITION 
A state j of a Markov chain is called an absorbing state if 


Pj) =1. 


Equivalently, 7 is an absorbing state if the (j, j)th entry of the one-step 
transition matrix of the chain is 1. 


The states 0 and 3 are both absorbing states in our Gambler’s Ruin example. In 
contrast, the taxi driver example has no absorbing states. The next example shows 
that some care must be taken in identifying absorbing states. 


Example 6.25 Anyone who has applied for a bank loan knows that the process of 
eventual approval (or rejection) involves many steps and, occasionally, a lot of 
complex negotiation. Figure 6.7 illustrates the possible route of a set of loan 
documents from (1) document initiation to (6) final approval or rejection. The 
intermediate steps (2)-(5) represent various exchanges between underwriters, 
loan officers, and the like. In this particular chain, two such individuals (at states 
3 and 5) have the authority to make a final decision, though the agent at state 3 may 
elect to return the documents for further discussion. 
The one-step transition matrix of this chain is 


law] 
l| 
ocooooco 


0 
a) 
0 
a) 
0 
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Fig. 6.7 State diagram for 
Example 6.25 
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Although the number | appears twice in P, only state 6 is an absorbing state of 
this chain. Indeed, peg = P(6 — 6) = 1; however, state 5 is not an absorbing state 
because ps5 = P(5 — 5)=0. Rather, the fifth row of P indicates that if the chain 
ever enters state 5, it will necessarily pass in the next transition into state 6 (where, 
as it happens, it will be “absorbed”’). a 


To be clear, a Markov chain may have no absorbing states (the taxi driver), a single 
absorbing state (Example 6.25), or multiple absorbing states (Gambler’s Ruin). 


6.5.1 Time to Absorption 


When a Markov chain has one or more absorbing states, it is natural to ask how long 
it will take to reach an absorbing state. Of course, the answer depends on where 
(i.e., in which state) the Markov chain begins. For any non-absorbing state i, define 
a random variable T; by 


T;= number of transitions until the Markov chain reaches an absorbing state, 
starting in state i 


This rv 7; is called the time to absorption from state i; the possible values of T; 
are 1,2,3,4,.... As we shall now illustrate, the distribution of T; can be approximated 
from the k-step transition matrices P“ for k= 1, 2, 3, .... For simplicity’s sake, 
consider first a Markov chain with a single absorbing state, which we will call a. 
Then the (i, a)th entry of P is the probability of transitioning directly from state i into 
the absorbing state a, which is therefore also the probability that 7; equals 1: 


P(i > a) = P(T; = 1) 


Since T; is always a positive integer, this also equals P(T; < 1), a fact which will 
prove important shortly. Now consider the (i, a)th entry of P?, which represents 
PH — a). There are two ways the Markov chain could transition from i to a in two 


steps: 


i — any non-absorbing state — a (T; = 2), or 
irva-a(T;=1) 


Therefore, the two-step probability P(i — a) does not represent the chance T; 
equals 2, but rather the chance that T; is at most 2. That is, 


P?)(i + a) = P(T; < 2). 


Following the same pattern, the k-step transition probability P(i — a) is equal 
to P(T; < k) for any positive integer k. 

If the Markov chain has two absorbing states a, and a, say, then the chance of 
being absorbed from state 7 in one step is simply the sum P(i— a,)+ P(i— ap), since 


those two events are mutually exclusive (you can only arrive in one state). Similarly, 
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the probability P(T; <2) is determined by adding P(i— a) +P©(i— ap), and so 


on. The general result is stated in the following theorem. 


THEOREM 

Consider a finite-state Markov chain, and let A denote the (non-empty) set of 
absorbing states. For any state i ¢ A, define 7;=the number of transitions, 
starting in state 7, until the chain arrives in some absorbing state. Then the cdf 
of T; is given by 


Pi(k) = PC; Sk) = Yo POG >a) k= 1,232.55. 
aca 


In the special case of a single absorbing state, a, this simplifies to 


Fr,(k) = P(T; <b) = P (i a) 


i 


The probability distribution of 7; (i.e., the pmf of the rv 7;) can then be 
determined from the cdf. 


Example 6.26 (Example 6.25 continued) Let’s consider the rv T,, the absorption 
time from state 1 (i.e., the number of steps from loan document initiation to the 
bank’s final decision). From the one-step transition matrix P, we know that 


Fr,(1) = P(Ty < 1) = P(T) = 1) = P(1 > 6) = ps =. 


The (1,6) entry of P? is also zero, so Fr,(2) = P(T; < 2) = P?)(1 — 6) =0. 
Software was used to obtain the matrices P*, ..., P’?, resulting in the following 
values for the (1,6) entry. 


k [1 2 3 4 5 6 7 8 9 10 1 12 
Fr, (k) | 0 0 5 6875 .75 85948984. 9336. 9570 9707 9810-9873 


The accompanying table is, of course, an incomplete description of the cdf of T;, 
since this process could theoretically be continued indefinitely. Next, because the rv 
T| is integer-valued, its pmf is easily determined from the cdf: 


BS 
a 
| 
Xv 
| 
v 


G20 62h F eo et 0-00 


P(T; = 3) = P(T1 < 3) — Pi S 2) = Fr, (3) — Fr,2) = 5-O0=5 


= 
= 
| 
= 
I 
me) 


(T, < 4) —P(T; < 3) = Fr, (4) — Fr,(3) = .6875 — .5 = .1875 
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Fig. 6.8 The (incomplete) pmf of T, from Example 6.26 


The first 12 probabilities in the pmf of 7, are as follows (their sum is .9873): 


k [1 2 3 4 5 6 7 8 9 10 in 12 


pr, (k) | 0 0 Pr) 1875 0625. 1094. .0390)—- .0352.s .0234—s «0137S 0103 —s .0063 


This incomplete pmf is graphed in Fig. 6.8. Notice that T, must be at least 
3, which is consistent with the state diagram in Fig. 6.7: it takes at least three steps 
to get from state | to state 6 (one of 1—~2—3-—6, 1—4-—3-—6, or 
1—4-—5-6). 

A call to the bank determines that the documents are in the hands of the 
underwriter indicated by state 4. So, let’s now consider the rv T, = time to absorp- 
tion (completion of the process) starting from state 4. Based on the state diagram, it 
seems reasonable to anticipate that it will typically take less time to reach state 
6 starting from state 4 than it did when the chain began in state 1. Reading off the 
(4, 6) entries of P, P*, ..., P'* yields the cdf values in the accompanying table; 
subtraction as before then gives the corresponding pmf values. 


k 1 2 3 4 5 6 7 8 9 10 11 12 
Fr,(k) 0 Pe Fs! a 8125 .9063 9219 9531 9688) 9785. 9863S 9907 ~—.9939 
Pr, (k) 0 .75 0 0625 0938 »=—.0156 = 0312) 0157'S 0097 —s «0078 ~=— .0044_~— .0032 


Notice that, starting in state 4, the chain is quite likely to be absorbed into 
state 6 in exactly two steps (either 4 5 — 6 or 4— 3 — 6, with probabilities .5 
and .25, respectively), and that it is impossible to move from 4 to 6 in exactly 
three steps. a 


Example 6.27 In the Gambler’s Ruin scenario with p = .55, how many games will 
Allan and Beth play against each other before one player goes broke? Recall that 
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the transition matrix P is set from Allan’s perspective, and that he begins with $2. 
Thus, the rv of interest is T>, the number of transitions (aka games), starting from 
Allan having $2, until the competition ends because Allan either has $0 or $3. The 
one- and two-step transition matrices of this chain appear in Example 6.14. Hence 


P(T, < 1) = P(2 > 0) + P22 3) = 0+ .55 = .55 
P(T> < 2) = P?)(2 — 0) + P (2 > 3) = .2025 + .55 = .7525 


In general, the cumulative probability P(T2 < k) can be determined by adding the 
(2,0) and (2,3) entries of the k-step transition matrix P“. These values were 
determined with the aid of software for k= 1 through 10 and are summarized in 
the accompanying table. 


k 1 2 3 4 5 6 i 8 9 10 
Fr, (k) 55 .7525 8886 9387 9724 9848 9932 9962 9983 9991 
Pr, (k) 55 .2025 1361 0501 0337 0124 0084 0030 0021 .0008 


It’s important to notice that T, indicates the number of steps required to enter 
some absorbing state (here, either $0 or $3), not the number of steps to enter a 
particular such state. = 


6.5.2 Mean Time to Absorption 
With 7; = time to absorption starting from state 7, the expected value of T; is called 
the mean time to absorption (MTTA) from state i: 


u;= E(T;)=expected number of transitions until the Markov chain reaches an 
absorbing state, starting in state i 


For each of the preceding examples, the incomplete pmf can be used to approxi- 
mate the MTTA from state 7. Consider the Markov chain in Example 6.26: 


My = E(T)) = Sok Pr, (k) © Sok pr, (k) 
= 1(0) + 2(0) + 3(.5) 4 4(.1875) +--+ 11(.0103) + 12(.0063) 
= 431 


To a hopefully reasonable approximation, on average the chain requires 4.31 
transitions, starting in state 1, to be absorbed into state 6. Similarly, the mean time 
to absorption from state 4 is approximately 
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oo 12 
M4 = Sok: pr,(k) os Sk pr,(k) 
k=l k=1 
= 1(0) + 2(.75) +--+ + 11(.0044) + 12(.0032) = 2.91 


For the Gambler’s Ruin competition with p = .55 and Allan’s initial stake at $2, 
the pmf displayed in Example 6.27 gives 


fy & 1(.55) + 2(.2025) + --- + 10(.0008) = 1.92 


That is, if Allan starts with $2 and p=.55, the expected length of the Gambler’s 
Ruin competition is approximated to be 1.92 games. 

In all such approximations, two things should be clear. First, the estimated 
means are smaller than the correct values, since the sums used are truncated 
versions of the correct summations and every term is nonnegative. So, in the 
Gambler’s Ruin scenario, “2 > 1.92. Second, the more terms we include in the 
truncated sum, the closer the approximation will be to the correct mean time to 
absorption from that state. Of course, additional terms require overcoming the 
practical hurdle of computing successively higher powers of the matrix P. With 
software, one could in practice use this method to get a very good approximation to 
the MTTA. 

Exercise 56 presents a different approximation method that always yields a 
better approximation to the mean time to absorption; moreover, it relies directly 
on the cdf values and thus does not require computing differences to form the pmf. 
But this is still an approximation; what we would really like is an explicit method 
for determining the exact mean time to absorption from various states in the chain. 
The following theorem provides such a result. 


MTTA THEOREM 

Suppose a finite-state Markov chain with one-step transition matrix P has 

r non-absorbing states (and at least one absorbing state). Suppose further that 

there exists a path from every non-absorbing state into some absorbing state. 
Let Q be the r x r sub-matrix of P corresponding to the non-absorbing 

states of the chain. Then the mean times to absorption from these states are 

given by the matrix formula 


n= (I-Q) ‘1, 


where ;=MTTA from the ith state in the Q sub-matrix, p= (wy, ..., 4)", 
I is the r x r identity matrix, and 1=(1,.. ay 


This theorem not only provides the exact mean times to absorption (as opposed 
to the earlier approximations) but also computes all MTTAs simultaneously. A 
proof of the MTTA Theorem will be presented shortly, but first we illustrate its use 
with our two ongoing examples. 
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Example 6.28 For the bank loan Markov chain in Example 6.25, state 6 is the only 
absorbing state, so there are r=5 non-absorbing states. The sub-matrix 
corresponding to these non-absorbing states is 


05 0.5 0 
00.5 5 0 
Q=|0 5 0 0 0 
00.5 0 5 
000 0 0 


This can be obtained by “crossing out” the row and column of P corresponding 
to absorbing state 6. Let z; = E(T;) be the mean time to absorption from state i for 
i=1,2,3,4,5. Then, according to the MTTA Theorem, 


p= (I-Q) 11 
=e a oy 1 
0 1-5 of |1 
SNG=s 4 of |4 
( 6 =6: @ 6). 4 
0 0 0 o 4] fa 
i 4 at + Sifi 4.5 
016 12 8 4/1 4.0 
=|0 8 16 4 2/\¢)2130 
0 4 8 12 6/1 3.0 
0 0 0 0 iff 1.0 


The inverse of I— Q was determined using software. 

So, for example, the mean time to absorption from state | is “7, = 4.5 transitions, 
slightly larger than our earlier approximation of 4.31. On the average, it takes 4.5 
steps to arrive at a loan decision starting from the time the loan documents are 
initiated. Similarly, the earlier estimate 42.91 was a little off from the correct 
answer of 44=3. The last entry of the vector p is obvious from the design of the 
chain: since state 5 transitions immediately into state 6 with certainty, T5 is 
identically equal to 1, and so its mean is 1. a 


Example 6.29 Consider once again our Gambler’s Ruin scenario, this time with an 
arbitrary probability p that Allan triumphs over Beth in any one game. The only two 
non-absorbing states are $1 and $2, so the required sub-matrix Q consists of the 
“center four” entries of the original 4 x 4 transition matrix: 


of 1 0 00 
_Il{l-p 0 p O 1 0 p 
P=9| 0 d-p.0 p|>? ies Al 
3, 0 0 01 
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There is a simple inverse formula for a 2 x 2 matrix: 


et) at [ 2 2] 67 


Applying Eq. (6.7) and the MTTA Theorem, 


l+p _ 2-p 
Mm T=ptP 


Since we have always started Allan with $2, let’s explore 2 further. If p = 1, so 
Allan cannot lose, then w2=(2— 1)/A —1+ -) =1. This is logical, since Allan 
would automatically transition from $2 to $3 in 1 step/game and the competition 
would be over. Similarly, substituting p=0O into this expression gives pl. =2, 
reflecting the fact that if Allan cannot win games then the chain must necessarily 
proceed along the path 2— 1-—0, a total of two transitions. For p=.55, the 
numerical case illustrated earlier, we have 


T= 55 1.45 580 


= = = = 1.92691 
1—.55+.55% .7525 301 


H2 


which is quite close to our previous approximation of 1.92. 

For what value of p is the competition expected to take the longest? Using 
calculus, one can find the maximum of “2 with respect to p, which turns out to occur 
at p= 2 — /3 = .268. If Allan begins with $2 and has a .268 probability of winning 
each game, the expected duration of the competition is maximized, specifically with 
Ho = 14+2/V/3 2.155 games. = 


Proof of the MTTA Theorem For notational ease, let 1, 2, ..., r be the 
non-absorbing states of the chain. Also, let A denote the set of absorbing states 
(which, if the Markov chain has s total states, could be enumerated asr+1,..., 5s). 
Starting in any non-absorbing state i, consider the first transition of the chain. If the 
chain transitions into any member of A, then it has been “absorbed” in one step and 
so T;= 1. On the other hand, if the chain transitions into any non-absorbing state 
j (including back into / itself), then the expected number of steps to absorption is 
1+E(T;), where the 1 accounts for the step just taken and 7; represents the time to 
absorption starting from the new state 7. Apply the Law of Total Expectation: 
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r 


E(T;) =1-P(i+A)+ (1+ £(T))) -P(i>/) 


j=l 
: 


= P(i— A)+ SoP(i = j)+ YA (r) -P(i—j) 


j=l 


Since the state space of the Markov chain is A U {1, 2, ..., 7}, the first two terms 


in the expression above must sum to |. Thus, we have y; = 1+ Se 14;,PC — Jj) for 


i=1,2,...r. Stacking these equations and rewriting slightly, we have 
wy =P Iu +--+ PL r)p, +1 
By = P(2 Muy te + P(2Q> 7 )p, +1 


Hyp = P(r > Vy te + P(r ru +1 


This stack can be written more compactly as p = Qu. + 1. Solving for pp yields the 
desired result. a 


The MTTA Theorem requires that every non-absorbing state can reach (at least) 
one absorbing state. That is, the set of absorbing states must be accessible from 
every non-absorbing state. What would happen if this were not the case? 


Example 6.30 In the Markov chain depicted in Fig. 6.9, 4 is an absorbing state, but 
it is only accessible from state 3. It is clear that the chain will eventually be 
absorbed into state 4 if X)=3 and will never be absorbed into state 4 if X)=1 or 
2. So, where does the MTTA Theorem break down? 

The one-step transition matrix P for this chain, the resulting sub-matrix Q for the 
non-absorbing states, and the matrix I— Q required for calculating mean times to 
absorption are 


1f5 50 0 ITs 50 5-5 0 
Pp=“|" ° Q=2|4 6 0 I-Q=|-4 4 0 
a Dt 3/0 0 7 0 0 3 
4/0 00 1 


The matrix I— Q is not invertible; this can be seen by noting that the first and 
second rows are multiples of each other, or by computing the determinant and 
discovering that det(I— Q) = 0. Because (I — Q)' does not exist, the formula from 
the MTTA Theorem cannot be applied. 


Fig. 6.9 State diagram for os) 5 6 3 
Example 6.30 (@ aa) (3) G4) 
4 ay 
a 
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Recall that the cdf of 7, can be determined from the appropriate entries of the k- 
step transition matrices; specifically, since the only absorbing state of this chain is 
state 4, 


Fr, (k) = P(T; < k) = P™ (1 — 4) = the (1,4) entry of P* 


The (1, 4) entry of the matrix P above is 0, so P(T, < 1) =0. But since state 4 is 
not accessible from state 1, the (1, 4) entry of every transition matrix P* is 0. Thus, 
P(T, <k)=0 for all positive integers k and pr, (k) = 0 — 0 = 0 for all k. Since the 
probabilities associated with T; sum to zero and not 1, 7, is not actually a valid rv 
(and so, in particular, has no mean). | 


In general, when the set of absorbing states is not accessible from one or more 
non-absorbing states, the matrix I — Q will be singular (i.e., not invertible). If a subset 
of the non-absorbing states can access the absorbing states (that’s true for state 3 in 
Example 6.30), one can apply the MTTA Theorem if one defines Q to be the 
sub-matrix of P corresponding to those states that can access the absorbing states. 


6.5.3 Mean First Passage Times 


We now briefly turn our attention back to regular Markov chains. In Sect. 6.4, we 
saw that one interpretation of the probability x; from the Steady-State Theorem is 
that 1/1; represents the expected number of transitions necessary for the chain to 
return to state j given that it starts there—the mean recurrence time for state j. With 
a clever use of the MTTA Theorem, we can also determine the expected number of 
transitions required for the chain to transition from a state i to a different state ;—the 
mean first passage time from / to j. 


Example 6.31 (Example 6.23 continued) For the ubiquitous taxi driver example, it 
was found that the steady-state probability for zone 3 is 13=.2 and, thus, the 
expected number of fares until the driver returns to zone 3 is 1/m3 = 1/.2=5. 

But suppose the taxi driver just dropped off a fare in zone 1 (or zone 2). 
He wonders how long it will take him to get back home to zone 3 for lunch. 
More precisely, he wishes to know the expected number of fares required to reach 
zone 3, starting from some other state (i.e., different than zone 3). 

The trick to answering the taxi driver’s question—i.e., to determine the mean 
first passage time for zone 3 when beginning in zone | or zone 2—is to pretend that 
zone 3 is an absorbing state, and then invoke the MTTA Theorem. Modify the 
original one-step transition matrix P of the Markov chain so that zone 3 is absorbing 


state, and label the new matrix P: 
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Now proceed as before: the sub-matrix for the non-absorbing states, which in P 
are zone | and zone 2, is 


from which 


pecan, SY o-Ba 


Thus, starting in zone 1, the average number of trips required for the taxi driver 
to get home to zone 3 is 3.33, while it takes twice that long on the average if he’s 
starting from zone 2. a 


6.5.4 Probabilities of Eventual Absorption 


As discussed in Example 6.29 in the context of Gambler’s Ruin, when a Markov 
chain has multiple absorbing states one can only speak of the mean time to 
absorption into the set of absorbing states, not any particular absorbing state (e.g., 
not time to $0 separate from time to $3). We can, however, ask about the probability 
of eventual absorption into state $0, as opposed to eventual absorption into state $3. 


DEFINITION 
Let a be an absorbing state of a Markov chain and let 7 be a non-absorbing 
state. The probability of eventual absorption into a from state i, denoted 
m(i — a), is defined by 

n(i > a) = lim P)(i 5 a) 


n—Co 


That is, (i — a) is defined to be the limit of the (i, a) entry of P” as n > co. This 
is consistent with our previous efforts to determine the probability of eventual 
absorption by examining P’ or P!®. But rather than approximate these 
probabilities by taking a high power of P, we now present an explicit method for 
determining them. 

Before illustrating the method for determining 2(i — a), a few observations are 
in order. First, if state a is not accessible from state i, then P“ (i a) =0 for all 
n and the limit is also zero, i.e., x(i + a) = 0 when i cannot access a. This occurred 
in Example 6.30, with state 4 not being accessible from states | or 2. 

Second, if the Markov chain has a single absorbing state a, then n(i > a) = | for 
every state 7 that can access a. That is, a chain with an accessible absorbing state 
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will always eventually be absorbed. This would be the case, for instance, in 
Example 6.25: it is a sure thing that the chain will eventually arrive at (and stay 
in) state 6, irrespective of where the chain begins. So, the interesting cases of 
determining m(i — a) are for Markov chains with multiple absorbing states, such as 
Gambler’s Ruin. 

Third, suppose we extended the preceding definition to non-absorbing states. 

That is, what can be said about 

lim P” (i > j) 

noo 
when j is not an absorbing state? If the Markov chain has any absorbing states (and 
assuming at least one of these is accessible from i), then the chain will eventually 
get absorbed and so P(i — j) — 0. If we have a regular Markov chain—which, in 
particular, means there are no absorbing states—then the Steady State Theorem 
tells us P (i > j) > m;, a Steady-state probability that is independent of i. For other 
cases, such as the cyclic chain of Example 6.20, the limit of P’(i—j) may not 
exist at all. 

On to the calculation: as in the proof of the MTTA Theorem, rearrange the states 
so that the non-absorbing states of the Markov chain are 1, 2, ..., r and the 
absorbing states are r+1, ..., s. Then the one-step transition matrix P can be 
partitioned as follows: 


1 
Q oR 
Bee Nedeeten tee (6.8) 
r+l 
o | I 
s | | | 


Expression (6.8) is sometimes called the canonical form of a Markov chain. In 
Eq. (6.8), Q is the r x r sub-matrix for the non-absorbing states, as before. The 
matrix O in the lower left of Eq. (6.8) consists entirely of zeros, since that quadrant 
of P indicates the probabilities of transitioning from an absorbing state (r+ 1, ..., 5) 
to a non-absorbing state (1, ..., 7). Similarly, I is the (s—r) x (s—7r) identity 
matrix, since its diagonal entries correspond to P(a— a) for the absorbing states 
and its off-diagonal entries to impossible events (transitions from one absorbing 
state to another). The “remainder” matrix R indicates the transition probabilities 
from the non-absorbing states into the absorbing states and can have (fairly) 
arbitrary entries. 

The probabilities of eventual absorption into every absorbing state from every 
non-absorbing state are provided by the following theorem. 
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THEOREM 
Consider a Markov chain with non-absorbing states 1, ..., 7 and absorbing 
states r+1, ..., s. Define sub-matrices Q and R of the one-step transition 


matrix P as in Eq. (6.8). Suppose further than every absorbing state is 
accessible from every non-absorbing state. Then the probabilities of eventual 
absorption are given by 


= (I-Q) 'R, 


where I is the 7 x r identity matrix and II is anv x (s — r) matrix whose entries 
are the probabilities n(@i— a) fori=1,...,randa=r+tl,...s. 


Some guidance for the proof of this theorem can be found in Exercise 57. 


Example 6.32 (Example 6.29 continued) To apply the previous theorem to our 
Gambler’s Ruin example, we need to reorder the states, so that non-absorbing states 
$1 and $2 come first while absorbing states $0 and $3 come last. The canonical form 
of P, along with the relevant sub-matrices Q and R, are 


Applying the previous theorem, along with the inverse formula (6.7) for a 2 x 2 
matrix, 


naooree[y TUFF 


—-Il+p 1 p 
l—p Pp 
1 1 p|fi-p 0] _ l=ppe Ip p 
l-p+p?|1-p 1}]| 0 p 1—2p--p’ P 
lo-py-p> dpe 


Reading off the entries of the matrix H, we have 


ee Ge, se 
($1 — $0) = ie n($1 — $3) are 
n($2.-» $0) = 1—2P +P n($2 + $3) =" _. 
1—p+p ee a a 
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In particular, if Allan starts with $2, the probability he will eventually win the 
competition is ($2 — $3) =p/(1 — p+ p’). As a check, this probability equals zero 
when p =O (Allan never wins games) and equals one when p= 1 (Allan always 
wins games). If p= .55, as in several of the previous examples in this chapter, 


7 55 _ 55 220 
1 —.55+.55?  .7525 301 


m($2 — $3) & .7309 
Notice that this is, to four decimal places, the probability we approximated by 
computing P’° with software and thereby obtaining P”>($2 — $3). a 


The matrices R and IT in Example 6.32 are square, but this is not necessarily the 
case in other scenarios. In general, Q is an r x r matrix (hence, square), but the 
dimensions of both R and II are r x (s—r). 


6.5.5 Exercises: Section 6.5 (44-58) 


44. Explain why a Markov chain with one or more absorbing states cannot be a 
regular chain. 

45. A local community college offers a three-semester athletics training 
(AT) program. Suppose that at the end of each semester, 75% of students 
successfully move on to the next semester (or to graduation from the third 
semester) and 25% are required to repeat the most recent semester. 

(a) Construct a transition matrix to represent this scenario. The four states are 
(1) first semester, (2) second semester, (3) third semester, (4) graduate. 

(b) What is the probability a student graduates the program within three 
semesters? Four semesters? Five semesters? 

(c) What is the average number of semesters required to graduate from this AT 
program? 

(d) According to this model, what is the probability of eventual graduation? 
Does that seem realistic? 

46. Refer back to the previous exercise. Now suppose that at the end of each 
semester, 75% of students successfully move on to the next semester (or to 
graduation from the third semester), 15% flunk out of the program, and 10% 
repeat the most recent semester. 

(a) Construct a transition matrix to represent this updated situation by adding a 
fifth state, (5) flunk out. [Hint: Two of the five states are absorbing. ] 

(b) What is the probability a student exits the program, either by graduating or 
flunking out, within three semesters? Four semesters? Five semesters? 

(c) What is the average number of semesters students spend in this program 
before exiting (again, either by graduating or flunking out)? 

(d) What proportion of students that enter the program eventually graduate? 
What proportion eventually flunk out? 
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(e) Given that a student has passed the first two semesters (and, so, is currently 
in her third-semester courses), what is the probability she will eventually 
graduate? 

The article “Utilization of Two Web-Based Continuing Education Courses 

Evaluated by Markov Chain Model” (J. Am. Med. Inform. Assoc. 2012: 

489-494) compared students’ flow between pages of an online course for two 

different Web layouts in two different health professions classes. In the first 

week of the classes, students could visit (1) the homepage, (2) the syllabus, 

(3) the introduction, and (4) chapter | of the course content. Each student was 

tracked until s/he either reached chapter 1 or exited without reading chapter 

1 (call the latter state 5). For one version of the Web content in one class, the 

following transition matrix was estimated: 


0 1 0 oO O 

21 0 33 05 41 
P=/].09 15 O .67 .09 
0 oO 0 1 0 

0 oO O 0 1 


When students log into the course, they are always forced to begin on the 
homepage. 

(a) Identify the absorbing state(s) of this chain. 

(b) Let 7,=the number of transitions students take, starting from the 
homepage, until the either arrive at chapter 1 or exit early. Determine 
P(T, <k) fork=1, 2,..., 10. 

(c) Use (b) to approximate the pmf of 7), and then approximate the mean time 
to absorption starting from the class homepage. 

(d) Determine the (true) mean time to absorption starting from the homepage. 

(e) What proportion of students eventually got to chapter 1 in the first week? 
What proportion exited the course without visiting chapter 1? 

Refer back to the previous exercise. After some content redesign, the same 

Web-based health professions course was run a second time. The first-week 

transition probabilities for the revised course were as follows: 


0 1 0 oO O 

15 O 43 .06 .36 
P=j|.09 .16 O .66 .09 
0 oO O 1 0 

0 oO O 0 1 


(a) How did the redesign affect the average amount of time students spent in 
the course (at least as measured by the number of Web page visits within a 
session)? 

(b) Did the redesign improve the chances that students would get to the chapter 
1 content before exiting the system? 
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49. 


50. 


51. 


52. 
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In Exercise 4, we introduced a game in which Michelle will flip a coin until she 
gets heads four times in a row. Define Xo = 0 and, for n > 1, X,, =the number of 
heads in the current streak of heads after the nth flip. 

(a) Construct the one-step transition matrix P for this chain, on the state space 
{0, 1, 2, 3, 4}. What is special about state 4? 

(b) Let Tp denote the total number of coin flips required by Michelle to achieve 
four heads in a row. Construct the cdf of Tp, Po <k), fork=1,2,..., 15. 
[Hint: The cdf values for k= 1, 2, 3 should be obvious. ] 

(c) Michelle will win a prize if she can get four heads in a row within 10 coin 
flips. What is the probability she wins the prize? 

(d) Use (b) to construct an incomplete pmf of To. Then use this incomplete pmf 
to approximate both the mean and standard deviation of To. 

(e) What is the (exact) expected number of coin flips required for Michelle to 
get four heads in a row? 

Refer back to Exercise 8. The article referenced in that exercise provides the 

following transition matrix for the states (1) current, (2) delinquent, (3) loss, 

and (4) paid, for a certain class of loans: 


95 04 0 Ol 
5.75 07.03 
0 O 1 0 
0 oO O 1 


P= 


(a) Identify the absorbing state(s). 

(b) Determine the mean time to absorption for a loan customer who is current 
on payments, and for a customer who is delinquent. 

(c) If a loan customer is current on payments, what is the probability s/he will 
eventually pay off the loan? What is the probability the loan company will 
suffer a loss on this account? 

(d) Answer (c) for customers who are delinquent on their loans. 

Refer back to Exercise 15(c). Calculate and interpret the mean times to 

absorption for this chain. For which opening strategy, cooperative or competi- 

tive, is the negotiation process longer on the average? 

Refer back to Exercise 14. Assuming Lucas begins searching for his uncle in 

room | and his uncle is hiding in room 6, what is the expected number of rooms 

Lucas will visit in order to “win” this round of hide-and-seek? 

Modify the Gambler’s Ruin example of this section to a $4 total stake. That is, 

Allan may start with x)= $1, $2, or $3, and Beth has $(4— xo) initially. As 

usual, let p denote the probability Allan wins any single game. 

(a) Construct the one-step transition matrix. 

(b) Determine the mean times to absorption for each of Allan’s possible 
starting values, as functions of p. 

(c) Determine the probability Allan eventually wins, starting with $1 or $2 or 
$3, as functions of p. 
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Refer back to the Ehrenfest chain model introduced in Exercise 2. Suppose 
there are m=3 balls being exchanged between the two chambers. If the left 
chamber is currently empty, what is the expected number of exchanges until it 
is full (i.e., all 3 balls are on the left side)? 

Refer back to Exercise 40. If a man has a low-prestige occupation, what is the 
expected number of generations required for him to have an offspring with a 
high-prestige occupation? 

Exercise 48 of Chap. 2 established the following formula for the mean of a rv 
X whose possible values are positive integers: 


E(X)=14+ 5 [1-F@), 
x=1 

where F(x) is the cdf of X. Hence, if the values F(1), F(2), ..., F(x*) are known for 

some integer x*, the mean of X can be approximated by 1+ ¥) pa 1 (-FQ@)]. 

(a) Refer back to Example 6.26. Use the given cdf values and the above 
expression with x* = 12 to approximate E(7,), the mean time to absorption 
starting in state 1. How does this compare to the pmf-based approximation 
in the example? How does it compare to the exact answer, 4.5? 

(b) Repeat part (a), starting in state 4 of the bank loan Markov chain. 

(c) Will this method always under-approximate the true mean of the rv, or can 
you tell? Explain. 

[Note: It can be shown that this “cdf method” of approximating the mean will 

always produce a higher value than the truncated sum of x - p(x).] 

This exercise outlines a proof of the formula IT=(I-— Q) 'R for the 

probabilities of eventual absorption. You will want to refer back to Eq. (6.8), 

as well as the proof of the MTTA Theorem. 

(a) Starting in a non-absorbing state i, the chain will eventually be absorbed 
into absorbing state a if either (1) the chain transitions immediately into a, 
or (2) the chain transitions into any non-absorbing state and then eventually 
is absorbed into state a. Use this to explain why 


n(i + a) = Pi a) + S> Pli > fai a), 


where A’ denotes the set of non-absorbing states of the chain. 

(b) The equation in (a) holds for all i © A’ and all a © A. Show that this 
collection of equations can be rewritten in matrix form as T= R+ QII, and 
then solve for IIT. (You may assume the matrix I — Q is invertible.) 

The matrix (I— Q)! arises in several contexts in this section. This exercise 

provides an interpretation of its entries. Consider a Markov chain with at least 

one absorbing state, and assume that every non-absorbing state can access at 
least one absorbing state. As before, A and A’ will denote the sets of absorbing 
and non-absorbing states, respectively. 
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(a) Consider any two non-absorbing states i and j. Let u,; denote the expected 
number of visits to state j, starting in state 7, before the chain is absorbed. 
(When j =i, the initial state is counted as one visit.) Mimicking the proof of 
the MTTA Theorem, show that 


Hi = J 1-P(ia)+ S- (1 +p) -PG— k) 
a&Aa keEA’ 
ait Sm PG 8 


kEA’ 


(b) Using similar reasoning, show that for i 4/, 


Hi = S- MyP(i — k) 


kEA’ 


(c) Let M be the square matrix whose (/, j)th entry is j;;. Combine (a) and (b) to 
establish the equation M=I+QM, and solve for M. 


6.6 Simulation of Markov chains 


A typical Markov chain simulation requires two elements: the one-step transition 
matrix, P, and an indication of the initial state X (either as a fixed state value or as a 
rv with a probability distribution). The actual simulation of a single realization of 
the chain Xo, X;, X>, ... then amounts to repeated selections from the transitional 
probability distributions specified by elements of P. Simulation of Markov chains 
allow us to confirm theoretical results and, more importantly, determine properties 
of Markov chains that are not covered by the theorems of this chapter or other 
theoretical results. 

The main step in any Markov chain simulation is to simulate a value for the next 
step, Xn+1, based on the transition probabilities coming out of the current step X,,. 
Let’s start with the initial state Xo. Suppose for one particular run of the simulation, 
Xo has been assigned the state i, either because that’s the fixed initial state or 
because a single draw from some initial distribution Vp yielded 7. Conditional on 
Xo=1, the distribution of X, is determined by the transition probabilities P(i — /) 
for j= 1, 2, 3, ..., which appear in the ith row of P. Thus, one needs to extract the 
ith row of P and use it as the basis for a single discrete simulation. If the result of 
this simulation is X; =/, then the jth row of P can be accessed to simulate X>, and 
so on. 


Example 6.33 Let’s simulate a typical day in the life of our taxi driver. Although a 
real taxi driver does not have the same number of fares each day, for purposes of 
this first simulation we’ll assume that he takes exactly 25 fares in | day. 

Suppose first that the driver begins each day in a random zone Xo, as in Example 
6.15, specifically with the initial distribution p(1)=.2, p(2)=.5, p(3)=.3. We 
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begin by simulating a single value from this initial distribution. Once that is 
determined, our program should then simulate a single value of X, using the row 
of P corresponding to the value of Xo, then do the same for X> based on the 
simulated value of X;, and so on. Figure 6.10 shows Matlab and R code for such 
a simulation. 


a b 
Pa [3.2 oby 22 26 2s: af «a 2 Po <= matrix(6e(.3).2; .5;21,7 .8j.1, 
-4,.4,.2) ,nrow=3,ncol=3,byrow=TRUE) 
states=[1 2 3]; states <- c(1,2,3) 
v0=[.2 .5 .3]; VO) <=. @(..25.5)-3) 
X=randsample(states,1,true,v0); X <- sample (states,1,TRUE, v0) 
current=X; current <- X 
for 1=1:25 for (ian 1325) 4 
nextstate= nextstate <- 
randsample(states,1,true,P(current,:)); sample (states,1,TRUE, P[current, ]) 
X=[X nextstate]; X <- c(X,nextstate) 
current=nextstate; current <- nextstate 
end } 


Fig. 6.10 Code for Example 6.33: (a) Matlab; (b) R 


In Matlab, P(current,:) calls for the row of P specified by the numerical 
index current; the code P[current,] performs the same task in R. The 
output of both of these programs is a vector, X, containing the sequence of states 
for the Markov chain (beginning with Xo). For example, one run of the above 
program in R yielded the following output: 

>xX 

[1] 21312222111131331322223313 

The randomly selected initial state was Xp = 2, followed by X; = 1, X.=3,..., 
and finally X25 = 3. (The symbol [1] at the left is not the initial state; this is just R’s 
way of denoting the beginning of X.) If we weren’t interested in the initial state 
of the chain, the code could easily be modified not to store Xo, in which case 
the indices of the output vector would match the time indices of the Markov chain 
(i.e., the subscripts on X1, Xo, ..., X25). 

To make Xo a fixed initial state instead of a true random variable, one need 
simply replace the two lines of code specifying the initial probability vector and the 
first random selection. In the Matlab code, the third and fourth lines could be 
replaced by the statement X=3; to fix the taxi driver’s initial state as zone 3. A 
similar comment applies to the R code. And, again, one could choose whether or not 
to store the initial state as part of the output vector. a 


It is important not to confuse the number of transitions of the chain with the 
number of runs of the simulation. In Example 6.33, both programs simulate the 
chain through 25 transitions, but only a single run. If it’s our desire to keep track of 
the chain’s behavior across many different runs, analogous to the simulations 
described at the ends of Chaps. 1-4, then we must add an additional layer of 
code, typically in the form of a surrounding “for” or “while” loop. 
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Example 6.34 As an illustration of the Steady-State Theorem, consider the model 
for Web users’ browser histories discussed in Example 6.19 (refer back to that 
example to see the one-step transition matrix). Let’s simulate the distribution of 
X75, the Web site category of a user’s 75th visited page. For variety’s sake, suppose 
users are equally likely to start surfing the Web in any one of the five Web site 
categories; recall that the initial distribution of a regular chain will not affect its 
long-run behavior. The programs displayed in Fig. 6.11 perform 10,000 runs of 
simulating this Markov chain up through X75. Purely to save space, the code to 
create P has been suppressed in Fig. 6.11, but it is very similar to what appears in 
Fig. 6.10. 

In the fourth line of code, we have employed a shortcut version of the 
randsample and sample functions in Matlab and R, respectively, to randomly 
and uniformly select a single random integer from 1 to 5 (this is the initial state). 
Both programs store the state of the Markov chain after 75 transitions in the vector 
named X75 for each of 10,000 runs. (Notice that the intermediate states X,, ..., X74 
are not permanently stored.) 

The 10,000 simulated values of X75 from one execution of the Matlab program 
are summarized in the accompanying table, along with the steady-state probabilities 
for this chain determined in Sect. 6.4. 


j 1 2 3 4 5 

# of times | 2822 1004 1599 3816 759 

Pea) |! 2822 1004 1599 3816 0759 
%j 2840 0948 1659 3791 0758 


The estimated and theoretical steady-state probabilities are quite similar. 
Remember that these two rows of probabilities should differ slightly for two 
reasons: first, this is only a simulation of 10,000 values of the rv X75, so there is 
natural simulation error; second, the steady-state probabilities indicate the behavior 
of X, as n— oo, and we don’t expect the rv X75 to have exactly this distribution 
(although it should be close). 


a b 
P=not shown; P <- not shown 
X75=zeros (10000,1); X75 <- NULL 
for i=1:10000 for (i in 1:10000) { 
current=randsample(5,1); current <- sample(5,1) 
for j=1:75 for (j in 2:75){ 
nextstate= nextstate <- 
randsample(1:5,1,true,P(current,:)); sample (1:5,1,TRUE, P[current, ]) 
current=nextstate; current <- nextstate 
end } 
X75 (i) =current; X75[i] <- current 
end } 


Fig. 6.11 Code for Example 6.34: (a) Matlab; (b) R a 
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Section 6.6 introduced the notions of time to absorption and mean time to 
absorption for Markov chains with one or more absorbing states. We can also use 
simulation to explore properties of time-to-absorption variables and first-passage 
times. 


Example 6.35 Consider again the bank loan application process described in 
Example 6.25, with lone absorbing state 6 (ultimate acceptance or rejection of the 
application), and the random variable T, = time to absorption from state 1 (docu- 
ment initiation). To simulate the distribution of T,, one begins the chain in state 
1 and continues to simulate its transitions until it arrives in state 6. The simulation 
program now must keep track of how many transitions occur, rather than just where 
the chain ends up. Figure 6.12 shows Matlab and R code for this purpose; again, to 
save space, the code for entering the matrix P is not included. 


a b 
P=not shown; P <- not shown 
Tl=zeros(10000,1); Tl <- NULL 
for i=1:10000 for (i in 1:10000) { 
current=1; current <- 1 
steps=0; steps <- 0 
while curret~=6 while (current!=6) { 
steps=stepst+l; steps <- steps+l 
nextstate= nextstate <- 
randsample(1:6,1,true,P(current,:)); sample(1:6,1, TRUE, P[current,]) 
current=nextstate; current <- nextstate 
end } 
T1l(i)=steps; T1[i] <- steps 
end } 


Fig. 6.12 Code for Example 6.35: (a) Matlab; (b) R 


The simulated distribution of 7, from one execution of the R code in Fig. 6.12b 
appears in Fig. 6.13. These particular 10,000 simulated values had a sample mean 
of 4.508 and a sample standard deviation of 2.276. Notice the sample mean is quite 
close to the theoretical expectation, E(T,) =4.5, determined in Sect. 6.5. 

Clearly, the sample mean of the simulated T, values is a better estimate of E(T,) 
than the approach utilizing the truncated pmf presented in the previous section. 
Of course, neither is strictly necessary since the mean of T; can be found explicitly 
using the MTTA Theorem. The new information provided by the simulation is 
a measure of the variability of T,: we estimate the standard deviation of T, to 
be 2.276, whereas no simple matrix formula exists for its theoretical standard 
deviation. = 


The preceding examples employed simulations primarily to confirm theoretical 
results established in earlier sections. (Or, perhaps better put, our earlier theoretical 
results validate the simulations!) The final two examples of this section indicate 
situations where we must rely on simulation methods to approximate values of 
desired quantities. 
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Fig. 6.13 Simulation distribution of T, in Example 6.35 


Example 6.36 Refer back to Example 6.13, which described Chinese cell phone 
users’ transitions between three major carriers. Suppose users may renew or change 
contracts annually, and that annual plans for the three carriers cost the following 
(in $US): 550 for China Telecom, 600 for China Unicom, and 525 for China 
Mobile. Assume that, because of governmental regulations, these prices will remain 
the same for the next 10 years. If last year the market shares of the three carriers 
were .4, .2, and .4, respectively and all contracts are about to come up for renewal, 
what is the average amount a Chinese cell phone customer will pay over the next 
decade? 

We will employ a Markov chain simulation to model the behavior of customers’ 
carrier choices for 10 consecutive years. Critically, we must keep track of how 
much money a customer spends each year—that is, our three states now have 
associated quantitative values. (This is a common instance where simulation proves 
useful.) Let Y= the total cost of ten 1-year calling plans for a Chinese cell phone 
customer. Figure 6.14 shows code for simulating Y using the techniques of this 
section. 

An initial state x0 is first determined using the specified initial probability 
distribution (here, vp =[.4 .2 .4]). Then, ten steps of the Markov chain are 
simulated; each of these states is temporarily held in nextstate. The vector 
AnnualCost stores the cost of a 1-year calling plan by calling the appropriate 
element of the Prices vector. Once a 10-year chain has been simulated, those 
annual costs are summed and stored as a simulated value of Y. 
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a 
P=[.84 .06 .1;.08 .82 .1;.1 .04 .86]; 


Prices=[550 600 525]; 
Y=zeros (10000,1); 
for i=1:10000 
v0=[.4 .2 .4]; 
AnnualCost=zeros (10,1); 
x0=randsample(1:3,1,true,v0) ; 
current=x0; 
for n=1:10 
nextstate= 
randsample(1:3,1,true,P(current,:)); 
AnnualCost (n)= 
Prices (nextstate); 
current=nextstate; 
end 
Y¥ (i) =sum(AnnualCost) ; 
end 
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b 


P <- matrix(c(.84,.06,.1,.08,.82,.1, 
.1,.04,.86) ,nrow=3,ncol=3,byrow=TRUE) 
Prices <- c(550,600,525) 
Y <- NULL 
for (i in 1:10000) { 
vO <- c(.4,.2,.4) 
AnnualCost <- NULL 
x0 <- sample (1:3,1,TRUE, v0) 
current <- x0 
for (n in 1:10) { 
nextstate <- 
sample (1:3,1,TRUE, P[current, ] ) 
AnnualCost[n] <- 
Prices [nextstate]; 
current <- nextstate 
} 


Y[i]=sum(AnnualCost) 


Fig. 6.14 Code for Example 6.36: (a) Matlab; (b) R 


A histogram of the 10,000 simulated Y values appears in Fig. 6.15. Notice that 
the distribution of Y has three spikes, at $5250, $5500, and $6000. These correspond 
to customers who keep the same carrier all 10 years; the large probabilities along 
the main diagonal of the transition matrix indicate reasonably strong customer 
loyalty. For this particular run, the simulated values of Y had a sample mean and 
standard deviation of $5503.40 and $199.32, respectively, from which we can be 
95% confident, using the methods of Sect. 5.3, that wy is between $5499.49 and 


$5507.31. 
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Fig. 6.15 Histogram of values of Y in Example 6.36 a 
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Example 6.37 Our taxi driver now makes one last appearance (hopefully to 
applause). He starts each morning at home in zone 3. Methods from Sects. 6.4 
and 6.5 allow us to determine the expected number of fares required for him to 
return home, or to reach one of the other two zones. But how long does it take him, 
in the typical day, to visit all three zones? Let 


Ta,=number of transitions required to visit every state at least once 
(not counting the initial state, Xo) 


To simulate T,;, our program must keep track of which states have been visited 
thus far. Once all states/zones have been visited, the numerical value of T,, for that 
simulation run can be recorded. Figure 6.16 shows appropriate code. 


a b 
PS[eo «2 VS gel: 18) 2p Ay Ape 2G P <= matrix(e (35.02.05) olg-4 8p ely 
-4,.4,.2) ,nrow=3,ncol=3,byrow=TRUE) 
Tall=zeros(10000,1); Tall <- NULL 
for i=1:10000 for (i in 1:10000) { 
current=3; current <- 3 
visits=[0 0 0]; visits <- c(0,0,0) 
Talltemp=0; Talltemp <- 0 
while (sum(visits) <3) while (sum(visits) <3) { 
nextstate=randsample (1:3,1, nextstate <- 
true,P(current,:)); sample (1:3,1, TRUE, P[current, ]) 
visits (nextstate)=1; visits[nextstate] <- 1 
current=nextstate; current <- nextstate 
Talltemp=Talltemp+1; Talltemp <- Talltemp+1 
end } 
Tall (i)=Talltemp; Tall[i] <- Talltemp 
end } 


Fig. 6.16 Code for Example 6.37: (a) Matlab; (b) R 


In both programs, a vector called visits keeps a record of which states the 
chain visits within that particular run. When chain j is visited (j= 1, 2, 3), the jth 
entry of visits switches from 0 to 1. Once all three entries of visits equal 1, as 
detected by its sum, the while loop terminates and the temporary count of 
transitions (Talltemp) is stored in Tall. The result of the program is 10,000 
simulated values of the rv 7,, stored in the vector Tall. 

Figure 6.17 displays a histogram of the 10,000 values resulting from running the 
Matlab program in Fig. 6.16a. The sample mean and standard deviation of these 
10,000 values were X = 8.1674 and s=5.8423. Hence, we estimate the average 
number of fares required for the taxi driver to visit all three zones to be 8.1674, with 
an estimated standard error of s/\/n = 5.8423/,/10,000 = 0.058423. Using the 
techniques of Chap. 5, we can say with 95% confidence that jz), the true mean of 
Ta, lies in the interval 


So 


Vn 


t 1.96 


=! 
im 


8.1674 + 1.96(0.058423) = (8.053, 8.282) 
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Fig. 6.17 Simulation Frequency 
distribution of the rv 7, in 4 
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Among the 10,000 simulated values of 7’, 4204 were at most 5 (so, 3 or 4 or 5). 
Hence, the estimated probability that the taxi driver visits all three zones within his 
first five fares is 


= ~ 4204 
p=P (Tan < 5) 


< =o00) 


The estimated standard error of this estimate is given by ,/p (1 — p)/n = .0049. 
Hence we are 95% confident that the true probability P(Tan < 5) lies in the interval 
4204 + 1.96(.0049) = (.4108,.4300). a 


6.6.1 Exercises: Section 6.6 (59-66) 


59. Refer back to Exercise 3. Suppose this machine produces 150 units on days 
when it is fully operational, 75 units per day when partially operational, and 
0 units when broken. Consider a month with 20 work days, and assume the 
machine ended the previous month fully operational. 

(a) Write a simulation of the ry Y=the number of units produced by this 
machine in 20 work days. 

(b) Create a histogram of simulated values of Y for at least 10,000 
simulation runs. 

(c) Construct a 95% confidence interval for the mean number of units produced 
by this machine across 20 work days. 

(d) Construct a 95% confidence interval for the probability that the machine 
produces at least 2000 units in such a month. 
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60. Four friends A, B, C, and D are notorious for sharing rumors amongst themselves. 


61. 


62. 


63. 


Being very gossipy but not particularly bright, each friend is equally likely to share 

a rumor with any of the other three friends, even if that friend has already heard 

it. (For example, if friend B most recently heard the rumor, each of friends A, C, 

and D is equally likely to hear it next, regardless of how B came to hear the 

rumor!) Let X,,= the nth person in this foursome to hear a particular rumor. 

(a) Construct the one-step transition matrix for this Markov chain. 

(b) Friend A has just overheard a particularly nasty rumor about a classmate 
and is eager to share it with the other three friends. Let T equal the number 
of times the rumor is repeated within the foursome until all of them have 
heard the rumor. Write a program to simulate T, and use your program to 
estimate E(T). 

A state lottery official has proposed the following system for a new game. In the 
first week of a new year, a $10 million prize is available. If nobody gets the 
winning lottery numbers correct and wins the prize that week, the value doubles 
to $20 million for the second week; otherwise, the prize for the second week is 
also $10 million. Each week, the prize value doubles if nobody wins it and 
returns to $10 million otherwise. Suppose that there is a 40% chance that 
someone in the state wins the lottery prize each week, irrespective of the 
current value of the prize. Let X,,—the value of the lottery prize in the nth 
week of the year. 

(a) Determine the one-step transition probabilities for this chain. [Hint: Given 
the value of X,,, X,,; can only be one of two possible dollar amounts.] 

(b) Let M be the maximum value the lottery prize achieves over the course of a 
52-week year. Simulate at least 10,000 values of the rv M, and report the 
sample mean and SD of these simulated values. [Hint: Given the large state 
space of this Markov chain, don’t attempt to construct the transition matrix. 
Instead, code the probabilities in (a) directly. ] 

(c) Let Y be the total amount paid out by the lottery in a 52-week year. 
Simulate at least 10,000 values of the rv Y, and report a 95% confidence 
interval for E(Y). 

(d) Repeat (c), but now assume the probability of a winner is .7 each week 
rather than .4. Should the lottery commission make it easier or harder for 
someone to win each week? Explain. 

Write a Markov chain simulation program with the following specifications. 

The inputs should be the transition matrix P, an initial state xo, and the number 

of steps n. The output should be a single realization of X, X5,..., X,, as either a 

row vector or a column vector. 

Refer back to Exercise 12. Suppose that the typical annual premium for a 

category 1 (safest) customer is $500; for category 2, $600; for category 

3, $1000; and for category 4 (riskiest driver), $1500. 

(a) Use a Markov chain simulation to estimate the distribution of the rv 
Y, =total premium paid by a customer over 10 years with the insurance 
company, assuming s/he starts in category 1. Create a histogram of values 
for Y;, and construct a 95% confidence interval for E(Y,). 

(b) Repeat (a) assuming instead that the customer starts as a category 3 driver. 
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Write a simulation program for Gambler’s Ruin. The inputs should be 

a= Allan’s initial stake, b =Beth’s initial stake (so a+b is the total stake), 

p =the probability Allan defeats Beth in any single game, and N = the number 

of tournaments to be simulated. The program should output two N-by-1 

vectors: one recording the number of games played for each of the N runs, 
and one indicating who won each time. Use your program to determine (a) the 
average tournament length and (b) the probability Allan eventually wins for the 
settings a= b=$5 and p= .4. Give 95% confidence intervals for both answers. 

Example 6.3 describes a (one-dimensional) random walk. This is sometimes 

called a simple random walk. 

(a) Write a program to simulate the first 100 steps of a random walk starting at 
Xo =0. [Hint: If X,,=s, then X,,; =s + 1 with probability 1/2 each.] 

(b) Run your program in (a) 10,000 times, and use the results to estimate the 
probability that a random walk returns to its origin at any time within 
the first 100 steps. 

(c) Let Ro =the number of returns to the origin in the first 100 steps of the 
random walk, not counting its initial state. Use your simulation to (1) create 
a histogram of simulated values of Ro and (2) construct a 95% confidence 
interval for E(Ro). 

A two-dimensional random walk is a model for movement along the integer 

lattice in the xy-plane, i.e., points (x, y) where x and y are both integers. The 

“walk” begins at Xo = (0, 0). At each time step, a move is made one unit left or 

right (probability 1/2 each) and, independently, one unit up or down (also 

equally likely). If X,,=the (x, y)-coordinates of the chain after n steps, then 

X, 18 a Markov chain. 

(a) Write a program to simulate the first 100 steps of a two-dimensional 
random walk. [Hint: The x- and y-coordinates of a two-dimensional random 
walk are each simple random walks. Since they are independent, the x- and 
y-movements can be simulated separately.] 

(b) Use your program in (a) to estimate the probability that a two-dimensional 
random walk returns to its origin within the first 100 steps. Use at least 
10,000 runs. 

(c) Use your program in (a) to estimate E(Ro), where Ro = the number of times 
the walk returns to (0, 0) in the first 100 steps. 


Supplementary Exercises (67-82) 


A hamster is placed into the six-chambered circular habitat shown in the 
accompanying figure. Sitting in any chamber, the hamster is equally likely to 
next visit either of the two adjacent chambers. Let X,, = the nth chamber visited 
by the hamster. 
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(a) Construct the one-step transition matrix for this Markov chain. 

(b) Is this a regular Markov chain? 

(c) Intuitively, what should the stationary probabilities of this chain be? Verify 
these are indeed its stationary probabilities. 

(d) Given that the hamster is currently in chamber 3, what is the expected 
number of transitions it will make until it returns to chamber 3? 

(e) Given that the hamster is currently in chamber 3, what is the expected 
number of transitions it will make until it arrives in chamber 6? 

Teenager Mike wants to borrow the car. He can ask either parent for permission 

to take the car. If he asks his mom, there is a 20% chance she will say “yes,” a 

30% chance she will say “no,” and a 50% chance she will say, “ask your 

father.” Similarly, the chances of hearing “yes’”/“no”/“ask your mother” from 

his dad are .1, .2, and .7, respectively. Imagine Mike’s efforts can be modeled 

as a Markov chain with states (1) talk to Mom, (2) talk to Dad, (3) get the car 

(“yes”), (4) strike out (“no”). Assume that once either parent has said “tyes” or 

“no,” Mike’s begging is done. 

(a) Construct the one-step transition matrix for this Markov chain. 

(b) Identify the absorbing state(s) of the chain. 

(c) Determine the mean times to absorption. 

(d) Determine the probability that Mike will eventually get the car if (1) he 
asks Mom first and (2) he asks Dad first. Whom should he ask first? 

Refer back to Exercise 14. Suppose Lucas starts in room | and proceeds as 

described in that exercise; however, his mean-spirited uncle has snuck out of 

the house entirely, leaving Lucas to search interminably. So, in particular, if 

Lucas enters room 6 of the house, his next visit will necessarily be to room 

5. (This really happened one summer!) 

(a) Determine the transition matrix for this chain. 

(b) Verify that this Markov chain is regular. 

(c) Determine the steady-state probabilities of this chain. 

(d) What proportion of time in the long run does Lucas spend in room 2? 

(e) What is the average number of room transitions between Lucas’ visits to 
room 1? 

Refer back to Exercises 20 and 21. 

(a) Suppose all four vans were operational as of Monday morning. What is the 
expected backlog—that is, the expected number of vans needing repair—as 
of Friday evening? 
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(b) Suppose instead that two of the four vans were down for repairs Monday 
morning. Now what is the expected backlog as of Friday evening? 
Five Mercedes E550 vehicles are shipped to a local dealership. The dealer sells 
one E550 in any week with probability .3 and otherwise sells none in that week. 
When all E550s in stock have been sold, the dealer requests a new shipment of 
five such cars, and it takes 1 week for that delivery to occur. Let X,,= the 
number of Mercedes E550s at this dealership n weeks after the initial delivery 
of five cars. 
(a) Construct the transition matrix for this chain. [Hint: The states are 0, 1, 2, 3, 
4,5.] 
(b) Determine the steady-state probabilities for this chain. 
(c) On the average, how many weeks separate successive orders of five E550s? 
Refer back to the previous exercise. Let m= the number of Mercedes E550s 
delivered to the dealership at one time (both initially and subsequently), and let 
p =the probability an E550 is sold in any particular week (m=5 and p= .3 in 
the previous exercise). Determine the steady-state probabilities for this chain 
and then the average number of weeks between vehicle orders. 
Sports teams can have long streaks of winning (or losing) seasons, but occa- 
sionally a team’s fortunes change quickly. Suppose that each team in the 
population of all college football teams can be classified as (1) weak, 
(2) medium, or (3) strong, and that the following one-step transition 
probabilities apply to the Markov chain X,=a team’s strength n seasons 
from now: 


8 2 0 
P=|2 6 2 
12 7 


(a) Ifa college football team is weak this season, what is the minimum number 
of seasons required for it to become strong? 

(b) If a team is strong this season, what is the probability it will also be strong 
four seasons from now? 

(c) What is the average number of seasons that must pass for a weak team to 
become a strong team? 

(d) What is the average number of seasons that must pass for a strong team to 
become a weak team? 

Jay and Carol enjoy playing tennis against each other. Suppose we begin 

watching them when they are at deuce. This means the next person to win a 

point earns advantage. If that same person scores the next point, then s/he wins 

the game; otherwise, the game returns to deuce. 

(a) Construct a transition matrix to describe the status of the game after 
n points have been scored (starting at deuce). [Hint: There are five states: 
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(1) Jay wins, (2) advantage Jay, (3) deuce, (4) advantage Carol, (5) Carol 
wins. | 
(b) Suppose Carol is somewhat better than Jay and has a 60% chance of 
winning any particular point. Determine (1) the probability Carol eventu- 
ally wins and (2) the expected number of points to be played, starting at 
deuce. [Hint: This should bear surprising similarity to a game played earlier 
in the chapter by Allan and Beth! ] 
The authors of the article “Pavement Performance Modeling Using Markov 
Chain” (Proc. ISEUSAM, 2012: 619-627) developed a system for classifying 
pavement segments into five categories: (1) Very good, (2) Good, (3) Fair, 
(4) Bad, and (5) Very bad. Analysis of pavement samples led to the construc- 
tion of the following transition matrix for the Markov chain X, = pavement 
condition n years from now: 


958 042 0 0 


0 
0 625 375 O 0 
0 0 797 .203 0 
0 0 O .766 .234 
0 0 0 0 1 


Notice that a pavement segment either maintains its condition or goes down by 

one category each year. 

(a) The evaluation of one particular stretch of road led to the following initial 
probability vector (what the authors call a “condition matrix”): [.3307 
.2677 .2205 .1260 .0551]. Use the Markov chain model to determine the 
condition matrix of this same road section 1, 2, and 3 years from now. 

(b) “Very bad” road segments require repairs before they are again usable; the 
authors’ model applies to unrepaired road. What is the average time 
(number of years) that a very good road can be used before it degrades 
into very bad condition? Make the same determination for good, fair, and 
bad roads. 

(c) Suppose one road segment is randomly selected from the area to which the 
condition matrix in (a) applies. What is the expected amount of time until 
this road segment becomes very bad? [Hint: Use the results of part (b).] 

A constructive memory agent (CMA) is an autonomous software unit that uses 

its interactions not only to change its data (“memory’’) but also its fundamental 

indexing systems for that data (“structure”). The article “Understanding 

Behaviors of a Constructive Memory Agent: A Markov Chain Analysis” 

(Knowledge-Based Systems, 2009: 610-621) describes a study of one such 

CMA as it moved between nine different stages of learning. (Stage 1 is 

sensation and perception; later stages add on other behaviors such as 

hypothesizing, neural network activation, and validation. Consult the article 
for details.) The accompanying state diagram mirrors the one given in the 
article for the authors’ first experiment. 
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(a) Construct the transition matrix for this chain. 

(b) What are the absorbing states of the chain? 

(c) All CMA processes begin in stage 1. What is the mean time to absorption 
for such a process? Here, “time” refers to the number of transitions from 
one learning stage to another. [Nofre: In this particular experiment, absorb- 
ing states correspond to any instance of so-called “inductive” learning. ] 

(d) Starting in stage 1, what is the probability a CMA will end the experiment 
in state 8 (constructive learning plus inductive learning)? 

The authors of the article “Stationarity of the Transition Probabilities in the 

Markov Chain Formulation of Owner Payments on Projects” (ANZIAM J., 

v. 53, 2012: C69-C89) studied payment delays in road construction in 

Australia. States for any payment were defined as follows: k weeks late for 

k=0, 1, 2, 3; paid (pd), an absorbing state; and “to be resolved” (tbr), meaning 

the payment was at least 1 month late, which the authors treated as another 
absorbing state. For one particular project, the following Q and R matrices 
were given for the canonical form of the one-step transition matrix: 


ofo 1 0 0 : . 

ga [). eo 2s 2 | Ra | OE 
210 0 0. .897 

ce ae 2/.013 0 

3 | 804 .196 


6.7 Supplementary Exercises (67-82) 593 


78. 


79. 


(a) Construct the complete 6 x 6 transition matrix P for this Markov chain. 

(b) Draw the state diagram of this Markov chain. 

(c) Determine the mean time to absorption for payment that is about to come 
due (i.e., one that is presently 0 weeks late). 

(d) What is the probability a payment is eventually made, as opposed to being 
classified as “to be resolved”? 

(e) Consider the two probabilities P(O — 1) and P(3 — pd). What is odd about 
each of these values? (The authors of the article offer no explanation for the 
irregularity of these two particular probabilities.) 

In a nonhomogeneous Markov chain, the conditional distribution of X,4; 

depends on both the previous state X, and the current time index n. As an 

example, consider the following method for randomly assigning subjects one at 

a time to either of two treatment groups, A or B. If patients have been 

assigned a group so far, and a of them have been assigned to treatment A, the 

probability the next patient is assigned to treatment group A is 


n-a+l 


P((n + 1)st patient assigned to A | a out of first n in A) = ig 
n 


Hence, the first patient is assigned to A_ with probability 
(0 —0+1)/(0+2) = 1/2; if the first patient was assigned to A, then the second 
patient is also assigned to A with probability (1 —1+1)/1+2)=1/3. This 
assignment protocol ensures that each next patient is more likely to be assigned 
to the smaller group. Let X,, = the number of patients in treatment group A after 
n total patients have been assigned (X) = 0). To simplify matters, assume there 
are only 4 patients in total to be randomly assigned. 

(a) Let P,; denote the transition matrix from n=O to n= 1. Assume the state 
space of the chain is {0, 1, 2,3, 4}. Construct P,. [Hint: Since Xo = 0, only 
the first row of P, is really relevant. To make this a valid transition matrix, 
treat the “impossible” states 1, 2, 3, and 4 as absorbing states.] 

(b) Construct P5, the transition matrix from n= | ton = 2. Use the same hint as 
above for states 2, 3, and 4, which are impossible at time n= 1. 

(c) Following the pattern of (a) and (b), construct the matrices P3 and P4. 

(d) For a nonhomogeneous chain, the multistep transition probabilities can be 
calculated by multiplying the aforementioned matrices from left to right, 
e.g., the 4-step transition matrix for this chain is P;P2P3P4. Calculate this 
matrix, and then use its first row to determine the likelihoods of 0, 1, 2, 3, 
and 4 patients being randomly assigned to treatment group A using this 
method. 

[Note: Random assignment strategies of this type were originally investigated in 

the article “Forcing a Sequential Experiment to be Balanced,” Biometrika 

(1971): 403-417.] 

A communication channel consists of five relays through which all messages 

must pass. Suppose that bit switching errors of either kind (0 to 1, or 1 to 0) 

occur with probability .02 at the first relay. The corresponding probabilities for 

the other four relays are .03, .02, .01, and .01, respectively. If we define 
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X, = the parity of a bit after traversing the nth relay, then Xo, X,,..., Xs forms a 

nonhomogeneous Markov chain. 

(a) Determine the one-step transition matrices P,, Pz, P3, P4, and Ps. 

(b) What is the probability that a 0 bit entering the communication relay 
system also exits as a 0 bit? [Hint: Refer back to the previous exercise 
for information on nonhomogeneous Markov chains. ] 

80. Consider the two-state Markov chain described in Exercise 39, whose one-step 
transition matrix is given by 


0 1 
0 l-a a 
1 Bb 1B 
for some 0<a, #<1. Use mathematical induction to show that the k-step 
transition probabilities are given by 


P®(0 > 0) = 6 + (1-a)(1-&) P01) =a(1-&) 
P®(1 0) =(1—2)(1—8) PMO 1) = +2(1- 8) 


where z = a/(a + f) and 6=1-—a-—f. [Note: Applications of these multistep 
probabilities are discussed in “Epigenetic Inheritance and the Missing Herita- 
bility Problem,” Genetics, July 2009: 845-850.] 

81. A 2012 report (“A Markov Chain Model of Land Use Change in the Twin 
Cities, 1958-2005,” available online) provided a detailed analysis from maps of 
Minneapolis-St. Paul, MN over the past half-century. The Twin Cities area was 
divided into 610,988 “cells,” and each cell was classified into one of ten 
categories: (1) airports, (2) commercial, (3) highway, (4) industrial, (5) parks, 
(6) public, (7) railroads, (8) residential, (9) vacant, (10) water. The report’s 
authors found that X,,= classification of a randomly selected cell was well 
modeled by a time-homogeneous Markov chain when a time increment of 
about 8 years is employed. The accompanying matrix shows the one-step 
transition probabilities from 1997 (n) to 2005 (n+1); rows and columns are 
in the same order as the sequence of states described above. 


.7388 .0010 .0068 .0010 .0325 .0131 .0000 .0055 .1984 .0029 
0001 .8186 .0201 .0560 .0045 .0227 .0002 .0413 .0350 .0015 
0004 .0107 .9544 .0054 .0058 .0031 .0002 .0094 .0105 .0001 
.0004 .0710 .0099 .8371 .0082 .0086 .0011 .0106 .0517 .0014 
0022 .0036 .0031 .0025 .9128 .0062 .0002 .0116 .0364 .0214 
0001 .0193 .0100 .0384 .0569 .7364 .0004 .0223 .1091 .0071 
.0000 .0065 .0142 0201 .0110 .0032 .9139 .0168 .0130 .0013 
0000 .0024 .0024 .0009 .0041 .0023 .0002 .9634 .0230 .0013 
0004 .0141 .0099 0156 .0513 .0057 .0002 .0988 .7920 .0120 
0001 .0010 .0003 .0014 .0136 .0001 .0000 .0055 .0096 .9684 
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(a) In 2005, the distribution of cell categories (out of the 610,988 total cells) 
was as follows: 

[4047 20,296 16,635 24,503 74,251 18,820 1505 195,934 200,837 54,160] 

The order of the counts matches the category order above, e.g., 4047 
cells were part of airports and 54,160 cells were located on water. Use the 
transition probabilities to predict the land use distribution of the Twin 
Cities region in 2013. 

(b) Determine the predicted land use distribution for the years 2021 and 2029 
(remember, each time step of the Markov chain is 8 years). Then determine 
the percent change from 1995 to 2029 in each of the ten categories (similar 
computations were made in the cited report). 

(c) Though it’s unlikely that land use evolution will remain the same forever, 
imagine that the one-step probabilities can be applied in perpetuity. What is 
the projected long-run land use distribution in Minneapolis-St. Paul? 

In the article “Reaching a Consensus” (J. Amer. Stat. Assoc., 1974: 118-121), 
Morris DeGroot considers the following situation: s statisticians must reach 
an agreement about an unknown population distribution, F. (The same method, 
he argues, could be applied to opinions about the numerical value of a parame- 
ter, as well as many nonstatistical scenarios.) Let Fo, ..., Fso represent 
their initial opinions. Each statistician then revises his belief about F as 
follows: the ith individual assigns a “weight” p,; to the opinion of the jth 
statistician (j= 1, ... 5), where p;;>0 and p,;;+---+pjs—=1. He then updates 
his own belief about F to 


Fi, = py Pio + +++ + DisFso 


This updating is performed simultaneously by all s statisticians (so, i= 1, 

pee) F 

(a) Let Fo = (Fo, - - -» Feo)‘, and let P be the s x s matrix with entries Pi; Show 
that the vector of updated opinions F,;=(Fy,, ..., F,)' is given by 
F,=PFo. 

(b) DeGroot assumes that updates to the statisticians’ beliefs continue itera- 
tively, but that the weights do not change over time (so, P remains the 
same). Let F,, denote the vector of opinions after n updates. Show that 
F,,=P’Fo. 

(c) The group is said to reach a consensus if the limit of F,, exists as n — oo 
and each entry of that limit vector is the same (so all individuals’ opinions 
converge toward the same belief). What would be a sufficient condition on 
the weights in P for the group to reach a consensus? 

(d) DeGroot specifically considers four possible weight matrices: 
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Discuss what each one indicates about the statisticians’ views on each other, 
and determine for which matrices the group ultimately reaches a consensus. Ifa 
consensus is reached, write out the consensus “answer” as a linear combination 
of F io, er) F vo. 


Random Processes 


In Chap. 1, we introduced the concept of a random event: a collection of one or more 
outcomes resulting from a random experiment (e.g., a randomly selected device 
works for 1000 h, or a randomly selected person has brown hair). In Chaps. 2-4, we 
studied random variables: numerical values resulting from random experiments (the 
number of flaws on a randomly selected wafer, the number of wins in 5 games of 
chance, the mass of a randomly selected object). In this chapter, we look at random 
processes, also called stochastic processes (“stochastic” is a synonym for “ran- 
dom’): time-dependent functions resulting from random phenomena. 

For example, consider modeling the number of people logged into a particular 
server over the course of the day. Since the exact times at which individuals log in 
are generally unpredictable, we might reasonably apply a model which treats logins 
as “random.” In particular, at any specific, fixed point in time—say, noon—we can 
model the number of people logged in by an appropriate (discrete) random variable. 
The new concept in Chap. 7 is to model the evolution of that random count over 
time. This gives us two dimensions of interest: the random variable itself (here, the 
count of logins) and time. 

Among the most common applications of random processes in engineering is 
that of random noise, a term for the disparity between what a received signal should 
“ideally” look like and what actually arrives at the receiver. Our ability to accu- 
rately model this distortion or noise will enable us to filter out some (hopefully 
large) proportion of that noise, thereby recovering a cleaner signal. 

In Sect. 7.1, we look at classifications of random processes according to whether 
the variable dimension and/or the time dimension are modeled as discrete or continu- 
ous. In Sect. 7.2, we connect previous ideas of mean, standard deviation, and so on to 
this new world of random processes. Section 7.3 introduces the concept of a station- 
ary random process and the special class of wide-sense stationary processes; these 
will be the backbone of signal processing in Chap. 8. Sections 7.4—7.7 consider 
several specific classes of random processes: discrete-time, Poisson, Gaussian, and 
continuous-time Markov. 


M.A. Carlton and J.L. Devore, Probability with Applications in Engineering, Science, and 597 
Technology, Springer Texts in Statistics, DOI 10.1007/978-1-4939-0395-5_7, 
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7.1 Types of Random Processes 


In Chap. 2, we defined a random variable as a rule that associates a number with 
each outcome in the sample space of some experiment. For example, we may 
associate with each outcome of the experiment of rolling two dice an integer 
X between 2 and 12 indicating the sum of the two up-facing sides. Any single 
realization of this experiment results in a specific number, a sample value of X. We 
define random processes analogously. 


DEFINITION 

For a given sample space £ of some experiment, a random process is any rule 
that associates a time-dependent function with each outcome in & Any such 
function that may result is a sample function of the random process. The 
collection of all possible sample functions is called the ensemble of the 
random process. 


Figure 7.1 illustrates this definition. Analogous to our notation for random 
variables, we will denote a (continuous-time) random process by X(f), while the 
lower-case x(t) indicates a particular sample function. 


>I 
; + 


Fig. 7.1. A random process 


Example 7.1 Some communication systems use phase-shift keying to transmit 
information. A quaternary phase-shift keying (QPSK) system can transmit four 
distinct symbols (often used to encode two bits at a time: 00, 01, 10, 11). The four 
symbols are distinguished by varying the phase at which they are transmitted; 
specifically, for k= 1, 2, 3, 4, the kth symbol is transmitted for T seconds with the 
wave 
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X(t) = cos(2n fot + 2/4 + kn/2), 0<t<T (7.1) 


for some predetermined frequency fo. If we consider the transmission of a single 
randomly selected symbol, we may let X(t) denote the corresponding transmitted 
wave. Each function x;,(f) in Expression (7.1) is a sample function; the set of these 
four functions comprises the ensemble of X(f) and is displayed in Fig. 7.2 for 
0<r<4. 


-l 


Fig. 7.2 Ensemble of the QPSK process in Example 7.1 a 


Example 7.2 Imagine the fluctuation in the value of Apple Inc. stock (symbol: 
AAPL) during the next 8-h trading day, measuring time from the opening bell on 
Wall Street. Since that fluctuation cannot be predicted precisely, we may 
reasonably model the stock’s value by an appropriate random process X(f); the 
ensemble of X(t) would be subject to the constraint X(0)= yesterday’s closing 
value. Two examples of possible sample functions appear in Fig. 7.3, where we 
have assumed a previous day’s closing value of $580. The ensemble of X(t) consists 
of all possible paths that the price of Apple stock could hypothetically take 
tomorrow, starting at $580 per share. Economists and statisticians use a variety of 
time series models to forecast the behaviors of such random processes. 
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Fig. 7.3. Two sample functions for a stock price’s fluctuation 


Example 7.3 Consider modeling the number of people N(‘) logged in to a specific 
server at time f¢ (perhaps measured from midnight). Since logins and logouts are 
unpredictable, we might reasonably apply a random process model to M(t). 
Figure 7.4 shows one possible sample function; notice that, since our variable is 
integer-valued, the function “jumps” rather than varying continuously. In this 
context, the ensemble of N(‘) consists of all nonnegative integer-valued functions 
n(t) that might hypothetically arise from successive logins and logouts. 


x(t) 
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Fig. 7.4 A sample function for the random process of Example 7.3 
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Example 7.4 A dust particle lands on the surface of the water in a glass. For 
simplicity’s sake, consider observing the motion of the particle only in the vertical 
direction (relative to our orientation) as time progresses. If we define the particle’s 
initial position as 0, we have a random process Y(t) = the vertical position of the 
particle t seconds after landing on the water. 

Figure 7.5a shows one possible sample function for this particle motion, while 
Fig. 7.5b shows 100 different sample functions and thus approximates the ensemble 
of Y(t). Notice that, as ¢ increases, the particle has greater potential to be farther 
away from its origin, the line y=0, since the particle has had more time to move. 
However, the particle will naturally “wiggle,” and so a typical sample function will 
return to its origin multiple times, rather than “flying off’ away from 0. 

This is an example of Brownian motion, a model physicists regularly use for the 
seemingly random motion of electrons and various microscopic particles. Brownian 
motion, in turn, is an example of a Gaussian process; we will study Gaussian 
processes (in particular, Brownian motion) in Sect. 7.6. 


v0) y(t) 
A A 


Fig. 7.5 Brownian motion: (a) a single sample function, (b) 100 sample functions a 


7.1.1 Classification of Processes 


As mentioned in the introduction to this chapter, we can classify random processes 
according to whether the variable and time dimensions are modeled as discrete or 
continuous. We call X(¢) a discrete-space process if its set of possible values at any 
time ¢ is finite or countably infinite. Otherwise, X(t) is a continuous-space process. 
Example 7.1 and Example 7.2 illustrate continuous-space processes, since the 
variables (height of the sinusoid, value of the stock) may take on a continuum of 
values. In contrast, we have a discrete-space process in Example 7.3, since the only 
possible values of N(‘) are the countable set {0, 1, 2, ...}. These classifications are 
consistent with our usage of the terms discrete and continuous in Chaps. 2 and 3. 
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The difference between discrete- and continuous-space processes is less impor- 
tant than distinguishing how we model time. All of our above examples are 
continuous-time processes, because time is measured on a continuous scale, 
typically [0, oo) or [0, T] for some fixed T. In contrast, imagine recording the 
value of Apple stock at the end of each day (or the number of people logged into a 
server at the end of each hour). Treating the variable as random, we would have a 
sequence X,, X2, X3, ..., where X denotes the value of the variable and the index 
n corresponds to the nth instance of measuring the process. The listing X,, X2,..., or 
more simply X,, is a discrete-time random process, also called a random 
sequence. We already saw a special type of random sequence, Markov chains, in 
Chap. 6; we will consider general discrete-time processes more carefully in 
Sect. 7.4. Throughout the rest of this chapter as well as Chap. 8, the term “random 
process” will always refer to a continuous-time process unless indicated otherwise. 


7.1.2. Random Processes Regarded as Random Variables 


In most of the figures in this section, you will notice we have displayed time, ¢, on 
the horizontal axis, while the “random” behavior is illustrated in the vertical 
direction. You may find it helpful to think of these as the “time direction” and 
“random direction,” respectively. To model a random process, we must truly 
understand its behavior in the “random direction.” Toward that understanding, 
consider Fig. 7.5b, which shows the ensemble of a Brownian motion process. Fix 
a time point—say, t= 1. Looking in the vertical direction, we have a collection of 
“heights” corresponding to the numerical values of the many sample functions y(t) 
displayed in the figure evaluated at t=1. These many values of y(1) form a 
probability distribution in the vertical direction: they show possible values of 
Y(1), and the underlying random experiment that generated these sample functions 
determines the relative likelihoods of those values. It is in that sense that the vertical 
axis of our graphs is the “random direction.” 

More simply (and perhaps more usefully) put, we make the following observa- 
tion: At any fixed time point fo, the ensemble of a random process X(‘) forms a 
probability distribution; that is, X(t) is a random variable. 


Example 7.5 An intended signal may have the form vg + acos(@of + 99), but ampli- 
tude variation may occur (due to natural current or voltage variation). We can 
define a random process by 


X(t) = vo + A cos(@ot + 90) 


where A is a random variable whose distribution describes the amplitude variation. 
Figure 7.6 illustrates part of the ensemble of X(t) when the model for amplitude 
variation is a uniform distribution on [—1, 1], for the specifications vp = 0, wp = 21, 
and 09 = 0. That is, X(t) = Acos(2zt) with A ~ Unif[—1, 1]. 
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Fig. 7.6 The ensemble of X(t) =Acos(2m?): (a) three sample functions; (b) hundreds of sample 
functions 


At time t= 0 (the far left edge of the graph), we may write X(0) = Acos(2z20) = 
Acos(0) =A. Since A ~ Unif[—1, 1] and X(0) =A, clearly X(0) ~ Unif[—1, 1]. That 
matches what our eyes see in the graph at f= 0: the values in the vertical direction 
seem “evenly” distributed on [—1, 1]. This same distribution can be seen at t= 0.5, 
1; 1.5, 25328 

In contrast, X(1/3) = Acos(22/3) = —.5A ~ Unif[—.5,.5]. This, too, is visible in 
the graph: at t= 1/3 = .33, the vertical expanse of the graph is not from —1 to | but 
rather from —.5 to .5. 

Finally, at t= 1.75 we have X(1.75) = Acos(72/2) = A(O) = 0, i.e., X(1.75) equals 
0 with probability | (i.e., for every member of the ensemble). We see in the graph 
that all functions of the form x(t) = acos(2nt) indeed equal 0 at t= 1.75 (as well as at 
t=0.25, 0.75, and 1.25). | 


Example 7.6 (Example 7.4 continued) A Brownian motion process Y(f) is partially 
characterized by the fact that, at any time f, Y(¢) has a Gaussian (i.e., normal) 
distribution with mean 0 and variance at, for some constant a. In Fig. 7.5, we used 
the parameter a=1 to generate the graph. Thus, in Fig. 7.5b the probability 
distribution displayed in the vertical direction at time r=1 is Gaussian with a 
mean of 0 and a variance of (1)(1)=1, ie., a standard normal distribution. In 
contrast, looking at time t=9, Y(9) is also Gaussian with mean zero but with 
standard deviation \/at = \/(1)(9) = 3, ie., Y9)~N(0, 3). The increase in the 
variability of the ensemble as f increases is apparent in Fig. 7.5b. The Gaussian 
nature of the model is reflected by the fact that we see a greater concentration of 
values nearer the y= 0 line and a sparser set of values far from that midline. 


In the previous two examples, we have focused on the probability distribution of 
X(t) at a single fixed time point, ¢. In fact, a random process is characterized by its 
simultaneous behavior at a// time points. To be precise, a random process X(f) is 
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characterized only if we know the joint distribution of X(t,), ..., X(¢,) for all sets of 
time points t) <...<¢, and r=1, 2, 3, 4, .... The “joint behavior” of a random 
process, particularly at two points in time, will be explored in depth in the next 
section. 


7.1.3 Exercises: Section 7.1 (1-10) 


1. Classify each of the following processes as discrete-time or continuous-time, 
and discrete-space or continuous-space. 
(a) The temperature in downtown Chicago throughout a day 
(b) The number of customers in line at a certain store throughout the day 
(c) The high temperature in downtown Chicago for each day in a year 
(d) The total number of customers served each day at a certain store 
2. Classify each of the following processes as discrete-time or continuous-time, 
and discrete-space or continuous-space. 
(a) The baud rate of a modem, recorded every 60 s 
(b) The number of people logged into Facebook throughout the day 
(c) The operational state, denoted | or 0, of a certain machine recorded at the 
end of each hour 
(d) The noise (in dB) in an audio signal measured throughout transmission 

. For each of the processes in Exercise 1, sketch two possible sample functions. 

. For each of the processes in Exercise 2, sketch two possible sample functions. 

5. Consider the server login scenario of Example 7.3. Assuming N(O) = 0, sketch 
sample functions for M(t) in each of the following cases: 

(a) The login rate exceeds the logout rate. 
(b) The logout rate exceeds the login rate. 
(c) The login and logout rates are equal. 

6. Correlated bit noise. Let X, be a sequence of random bits (Os and 1s) 
constructed as follows: Xo>=0 or 1 with probability 5 each. For n> 1, 
X,=X,_, with probability .9 and X, = 1 — X,_, with probability .1. 

(a) Write out and sketch two examples of possible sample functions of X,, for 
n=0,..., 10. 

(b) Which sample function is more likely to be observed: 01100101010, or 
00011110000? Explain. 

(c) Find the distribution of X,, at time n= 1. 

7. Binary phase-shift keying (BPSK) is a simplified version of the QPSK system 
described in Example 7.1. One version of the system transmits the bit b, 0 or 
1, with the waveform 


Rw 


Xp(t) = cos(2afot + m+ br) 0<t<T 


for suitable choices of frequency fp and time duration T. For purposes of this 
example, assume fy = 1 and T= 1. 

(a) Sketch the ensemble of this process. 

(b) Can the two bits be distinguished at time t=0.25 s? Why or why not? 
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(c) Suppose random bit noise with P(O)=.8 and P(1)=.2 is transmitted via 
BPSK, and call the resulting random process X(t). Find the probability 
distributions of X(0) and of X(.5). 

8. Consider a random process X(f) defined by X(t) = Acos(at) + Bsin(at), where 
A and B are iid N(5, 2) rvs. 

(a) Graph a sample function of X(¢). 

(b) Find the probability distributions of X(1/4) and of X(1/2). 

(c) Find the joint pdf of X(1/4) and X(1/2). [Hint: Refer back to Sect. 4.7.] 

9. A gambler plays roulette conservatively: she bets on black every time, which 
gives her probability 18/38 of winning on each spin. Define a random sequence 
X, =the number of wins she has after the nth spin for n= 1, 2, 3,.... 

(a) Is X,, a discrete-space or continuous-space sequence? 

(b) Sketch two possible sample functions (sequences) forn=1,..., 10. 

(c) What is the probability distribution of X,, for fixed n? 

10. Refer to Example 7.2. Suppose Apple stock has value $580 at time t= 0, that 
the stock’s value increases an average of 25 cents per day, and that the variation 
around that increasing trend can be described by a Brownian motion process 
with parameter a = 20 (see Example 7.6). 

(a) Write an expression for X(f), starting at time t=O, in terms of ¢ and the 
process Y(t) from Example 7.6. 

(b) Sketch two sample functions of X(‘). 

(c) Find the probability distribution of Apple’s stock at the end of 1 week of 
trading (t= 5) and at the end of 2 weeks’ trading (t= 10). 


7.2 Properties of the Ensemble: Mean and Autocorrelation 
Functions 


In the previous section, we introduced the notion of a random process X(t). We 
emphasized that, for a fixed time value f, X(f) is a random variable possessing some 
probability distribution. Moreover, if we look at two fixed time points ¢ and s, the 
two random variables X(f) and X(s) are usually not independent, and we can attempt 
to describe their joint probability distribution. In this section, we explore these ideas 
further. 


7.2.1. Mean and Variance Functions 


At any particular time ¢, the random variable X(t) has a probability distribution and 
thus has both a mean value and variance. Since X(f) for fixed ¢ is a random variable, 
we should be able to calculate its mean using the techniques of Chaps. 2 and 3. Such 
a mean value exists for every time ¢, and the mean might not be the same at every 
time ¢, i.e., the mean of X(f) may vary with t. Then considering all values of ft gives a 
mean function. Similar comments apply to the variance and standard deviation of t. 
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DEFINITION 
The mean function of a random process X(f) is given by 


Hx(t) = E[X(o)], 


where E[X(f)] is the expected value of the random variable X(t) for the fixed 
time point f. 
Similarly, we define the variance function of X(t) by 


o(t) = Var(X(0)) = E](X() — x(0))?] = B[X7(0] — wx(OP? 


and the standard deviation function of X(t) by oy(t) = \/Var(X(t)). 


Notice that the mean, variance, and standard deviation functions are nonrandom 
functions of the time variable ¢, just as the mean, variance, and standard deviation of 
a random variable are numbers and not random quantities. It’s vital to keep in mind 
that the mean function of a random process is taking an average with respect to the 
ensemble (i.e., in the “random direction”) and not with respect to time. 


Example 7.7 Reconsider Example 7.5 from Sect. 7.1, where the random process 
X(t) was defined by the equation X(‘) = vp + ACos(Wof + Ao). To find the mean and 
variance functions of X(t), we apply the properties of expected value and variance 
established in earlier chapters. Remembering that time ¢ is fixed, the entire term 
COS(@pt + Oy) may be treated as a constant, from which we obtain 


Mx (t) = E[X(t)] = E[vo + Acos(@ot + 90)] = vo + E(A) - cos(wot + Oo) 
o;(t) = Var(X(t)) = Var(vp + A cos(@ot + 99)) = Var(A) - cos*(@ot + 0) 


Remember that we must square a multiplicative constant for variance. 

In the case where A~Unif[—1, 1] illustrated in Fig. 7.6, we have E(A)= 
(—1 +1)/2=0. Thus the mean function of X(f) is vy(t) = vo + Ocos(@ot + Ao) = Vo. 
We can see this in Fig. 7.6: at any fixed time point f, the average of the values in the 
vertical (“random”) direction is clearly zero, the value of vo for that graph. If we 
imagine vertically averaging these functions, we would arrive at the constant 
function f(t) =0, as claimed. 

In this same case, the variance of A is given by Var(A) = (1 — hie: = 1/3, 
whence 


1 
o7(f) = ci cos” (wot + 0) 


Thus, the variability of X(f) in the vertical direction increases and decreases in a 
periodic manner as we vary t. This, too, can be seen in Fig. 7.6: the vertical spread 
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varies with ¢, and this variability is the largest at t=0, 0.5, 1, and so on, when the 
cosine function is maximal. The standard deviation function of X(¢) is 


1 1 
ox(t) = a cos?(wot + 09) = —=| cos(a@ot + Ao)| 


v3 


The absolute value ensures that our standard deviation is always nonnegative. ll 


Example 7.8 In the previous example, we modeled amplitude variation with a 
uniform distribution. However, this is not a realistic model for most observed 
amplitude variation. Engineers frequently model amplitude variation A from a 
signal with a Rayleigh distribution. One example of a Rayleigh pdf is given by 


a’ /2 
ae a>0O 
=> 2 
fala) { 0 otherwise ) 


The graph of Eq. (7.2) appears in Fig. 7.7a, illustrating that a small amplitude is 
more likely than a large one. Notice that this model only provides positive values 
for the amplitude. Figure 7.7 shows the ensemble of X() = Acos(2mt) when A has 
the pdf specified in Eq. (7.2). 


a 
a) 
4 


>a 


Fig. 7.7 (a) a Rayleigh pdf; (b) the resulting ensemble of X(t) = Acos(2m?) 


It’s clear from the graph that the mean function is not zero; rather, it appears to 
be itself a sinusoid. (See if you can estimate the amplitude of the mean function by 
looking at t= 0 on the graph.) Borrowing from Example 7.7, it’s still true that X(A) 
has mean and variance functions given by px(t)=vot+E(A)cos(@of+ Oo) and 
ox(t) — Var(A)cos7(wot + Oo), respectively. Using calculus, it can be shown the pdf 
in Eq. (7.2) has expected value /n/2 s 1.253; hence, the mean function of the 
random process displayed in Fig. 7.7 is x(t) ¥ 1.253cos(t), which is indeed a 
sinusoid. (= 
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Example 7.9 Signal plus noise. A deterministic (i.e., nonrandom) signal s(t) incurs 
noise during transmission, in which case the received message may have the form 
YO=s()+N(0. The term N(t) is called the “noise component” of the received 
signal, and a variety of models can be used to describe such noise. Figure 7.8 below 
illustrates (part of) the ensemble of 


Y(t) = 3cos(2nt+ 2/2) +N(t), 


where M(t) is Gaussian noise with mean 0 and standard deviation 1 (that is, at each 
fixed time point ¢, N(f) is standard normal). 

Let’s first determine the probability distribution of Y(f) at both f= 0.25 and t=2. 
With = s(t)=3cos(2at+n2/2), Y(0.25) =s(0.25) + N(0.25) = —3 + N(0.25). Since 
N(0.25) has a Gaussian distribution with mean 0 and variance 1, it follows that 
Y(0.25) is also Gaussian, but with a mean of —3 and standard deviation 1. Similarly, 
Y(2) = s(2) +N(2) = 0+ N(2) = N(2), so Y(2) is standard normal. We can visualize 
both of these distributions by looking vertically in Fig. 7.8. 


Fig. 7.8 The ensemble for y(t) 
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To find the mean and variance functions of Y(t), note that s(t) is to be treated as a 
constant with respect to the ensemble. We find that 


y(t) = E[Y()] = Els(t) + N(0] = s(t) + EIN()] = (2) +0 = (2) 


That is, the mean function of this random process is just the original signal, s(t); 
this is the sinusoid that “carves down the middle” of Fig. 7.8. Finally, since s(t) is an 
additive constant in the expression for Y(f), oy(t) = Var(Y(t)) = Var(s(t) + N()) = 
Var(N(t)) = 1. The amount of variability around the signal is the same at every point 
t in the process. 
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Notice that the distribution of the noise component, M(t), is N(O, 1) at every 
point t. But be careful: saying a process has the same distribution at every point t is 
very different from saying N(f) is a constant! = 


Example 7.10 Signal plus noise, round two. Let’s modify the previous example by 
specifying that the spread of the noise component M(t) varies with time; specifi- 
cally, suppose that M(f) is Gaussian with mean 0 and variance t. The ensemble of 
the resulting random process Y(t) appears in Fig. 7.9. The mean function of Y(f) is 
still s(; however, following the derivation in Example 7.9, we find Var(Y(‘)) = 


Var(N(t)) = t, so oy(t) = vi. 


Fig. 7.9 The ensemble for x(t) 
Example 7.10 10! 
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7.2.2 Autocovariance and Autocorrelation Functions 


The mean and variance describe the distribution of a single random variable. In the 
context of a random process, the mean and variance functions contain information 
about the behavior of the ensemble at each single point in time. But it should be 
clear that for two different times ¢ and s, the random variables X(t) and X(s) will 
typically be related. A complete statistical analysis of a random process should 
include an exploration of that relationship. To that end, we now extend the notion of 
covariance from Chap. 4 to a random process. 
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DEFINITION 
The autocovariance function of a random process X(t) is defined by 


Cxx(t, 8) = Cov(X(t),X(s)) = El(X(t) — ax(2))(X(s) — #x())] 


Notice that the autocovariance function is a nonrandom function of two 
time points, ¢ and s. 


We can interpret the autocovariance function of X(t) much as Cov(X, Y) was 
interpreted back in Chap. 4. When Cyx(t, s) > 0, above-average values of X(‘) tend 
to be associated with above-average values of X(s). That is, when X(f) is above its 
mean function at time ¢, it also tends to be above its mean function at time s (and 
vice versa). If Cxx(t, s) <0, then above-average values of the random process at 
time f are associated with below-average values at time s (and vice versa). 

Properties of the autocovariance function follow directly from the properties 
previously derived for covariance. We provide a partial listing here. 


PROPOSITION 

Let Cyx(¢, s) denote the autocovariance function of a random process X(f). 
1. Cxx(t, s) = Cxx(s, t) 

2. Cxx(t,s) = E|X(t)X(s)] — ux(t)ux(s) 

3. ox(t) = Var(X(¢)) = Cov(x (2), X(t) = Cxx(t,0) = E[X?(0)] — ne (0) 


In the engineering literature, E[X(t)X(s)] in property 2 is called the autocorrela- 
tion function of X(f) and is denoted Ryx(t, 5). Although it will be vital to our study of 
signal processing in Chap. 8, don’t confuse this with the correlation coefficient from 
Chap. 4; in particular, the sign of Ryx(t, s) does not indicate the direction of the 
association between X(f) and X(s), and the magnitude of Ryx(¢, 5) is not bounded by 1. 


Example 7.11 Let’s find the autocovariance function of the random process X(t) = 
Acos(2at) from Examples 7.5 and 7.7. We will illustrate two methods here. Since we 
already have the mean function of X(7), we can calculate the autocorrelation function 
and then apply property 2 from the preceding proposition: 


Ryxx(t,s) =E [xX (t)X(s)] = E{A cos (2at)A cos (27s) | =E [A’ cos(2z1) cos(2z5) | 
= E(A’) cos(2zt) cos(2as) => 

Cxx(t, 8) = E[X(t)X(s)] — wy (t) ux (s 
= E(A* 


) cos (2zt) cos (2as) — E(A) cos (2zt) : E(A) cos (22) 
= [E(A’) — (E(A)) all cos (2zt) cos (2zs) 
= Var(A) cos(2zt) cos (2z3) 
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Alternatively, we can manipulate the covariance expression directly by applying 
its distributive properties from Chap. 4: 


Cxx(t, 5) = Cov (x (t) ,X(s)) = Cov (A cos (2zt) ,Acos (27s) ) 
= Cov(A, A) cos(2zr) cos (22s) 
= Var(A) cos(2zt) cos (273) 


As a check, substituting s=t gives Cyy(t, ) = Var(A)cos"(2nt), which matches 
the expression for ox(t) we found in Example 7.7 (with @p = 2x and 0) = 0). | 


Example 7.12 Let’s now consider a sinusoid with phase variation, rather than 
amplitude variation. Define a random process X(t) by 


X(t) = Ap cos(a@ot + O) (7.3) 


where the phase shift © is a rv, uniformly distributed on the interval (—z, 2]. The 
amplitude Ap and fundamental frequency wo # 0 are constants. Figure 7.10 shows 
several sample functions for this random process with Ag= | and @p = 21. 

Until now, we have managed to compute means and variances without any 
calculus; however, because the random process X(t) defined by Eq. (7.3) is a 
nonlinear function (cosine) of a random variable, we must rely on calculus here. 
Specifically, we apply the formula for the expectation of a function of a continuous 
random variable, presented in Chap. 3: 


fx (t)= E|X(t)| = E[Ap cos(aot + ©) 


. 1 
=|" Ao cos(@ot + A)fe(0)d0 = | Ao cos(@ot + ae do 


= =| cos(@ot + 0)d0 = = (0) =0 


The last integral equals zero because, as a function of 6, it represents the 
integration of a cosine through one period. A mean function identically equal to 
zero coincides with what we see in Fig. 7.10. 

Since the mean function is zero, the autocovariance and autocorrelation 
functions will be identical. Calculation of these functions requires a trig identity: 
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Cxx(t, 8) = E[X()X(s)] — (0) (0) = E[X()X(5)] 
= E(Ao cos(wot + ©) - Ap cos(aos + ©) ] 
= AjE| cos (wot + 0) cos (we 7 0) | 
ACE So t+O+a O) + cos(wot + O — [wos + @) 
A? 
= 5 Fle s(@ot + @ 20) + cos(wot — wos) 
= “Sple (@ot + w 20)| | Ai (wot — w ) 
2 2 2 
=4.0+2 (@ot — @ =e (wot — @ ) 
The last expected value equals zero because it represents an integral of 
through two periods. Finally, the variance function is given by 
2 2 2 
x(t) = Cxx(t,t) = 40 (wot — wot) = a cos(0) = “ 
Notice that the variance function of X(f) is a constant (same spread for all f) 
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which agrees with Fig. 7.10. 
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7.2.3. The Joint Distribution of Two Random Processes 


Some applications involve the consideration of two random processes X(f) and Y(?). 
We may then be concerned not only with their individual distributions but also their 
joint behavior. This is especially true when Y(f) is the result of some action taken on 
X(t), such as passing the random signal X(t) through an appropriate filter. To 
quantify their relationship, we define the cross-covariance function of X(é) with 
Y(O by Cyy(t, s) = Cov(X(4),Y(s)) and the cross-correlation function of X(é) with 
Y(O by Ryy(t, s) = E[X(HY(s)]. These two functions are, not surprisingly, connected 
by the formula Cyy(t, 5) = Ryxy(t, 5) — ux(Dpy(s). 


DEFINITION 

Two random processes X(t) and Y(f) are independent if, for all fixed f and s, 
the random variables X(t) and Y(s) are independent rvs as defined in Chap. 4. 
X(t) and Y(#) are uncorrelated if, for all t and s, Cyy(¢, s) = 0. Finally, X(#) and 
Y(‘) are orthogonal if Ryy(t, s) =0 for all ¢ and s. 


Notice in these definitions that properties must hold for all times ¢ and s. 
For example, the independence of X(‘) and Y(#) requires that the random variables 
X(2) and Y(10) be independent, as must X(2) and Y(2) be, and so on. A similar 
comment applies to being uncorrelated or orthogonal. 

As in Chap. 4, independence is a stronger condition than zero correlation: 


X(t) and Y(t) independent => X(t) and Y(t) uncorrelated, 


but the converse is false. If X(f) and Y(f) are uncorrelated, it follows from the 
definition of covariance that E[X(AY(s)] = E[X(JE[Y(s)] for all t and s. Thus, being 
uncorrelated does not imply being orthogonal (nor vice versa); however, if either 
random process has mean identically equal to zero, then the properties of being 
uncorrelated and orthogonal are equivalent. 


7.2.4 Exercises: Section 7.2 (11-24) 


11. Consider the QPSK system described in Example 7.1 as a model for random 
noise. Suppose the four possible symbols to be transmitted are equally likely to 
occur, i.e., we have a random process 


X(t) = cos(2afot+2/4 + Kn/2) 


where K is 0, 1, 2, or 3 with probability .25 each. 
(a) Find the mean function of X(f). Simplify as much as possible. 
(b) Find the variance function of X(#). Are your answers consistent with 
Fig. 7.2? 
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12. 
13. 


14. 


15. 


16. 


17. 


18. 


19. 
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Show that Cyx(t, 5) = E[X()X(s)] — ux(Opx(s). 

Consider the random process X(t) = Vp + ACOS(@ot + Oo) from Example 7.7. Find 

the autocovariance function and autocorrelation function of X(¢). 

Let X(t)=At+B, where A and B are independent random variables with 

A~Unif[0, 6] and B ~ Unif[—10, 10]. 

(a) Describe the ensemble of X(t). 

(b) Determine the mean function of X(‘). 

(c) Determine the autocovariance function of X(f). 

(d) Determine the autocorrelation function of X(t). 

(e) Determine the variance function of X(t). 

Let M(t) be a Gaussian noise process as in Example 7.9, with mean 0 for all ¢ 

and autocovariance function Cyj(t, s) = er 

(a) Verify that V(t) has variance 1 for all ¢. 

(b) If M(t) > 0, would you predict that N(s) > 0 or N(s) < 0? Explain. 

(c) Determine the correlation coefficient p of N(10) and N(12). 

(d) Determine the probability distribution of N(12) — N(10). 

Let M(t) be the Gaussian noise process of Example 7.10, with mean function 

0 and variance function ¢. 

(a) Calculate P(N(1) > .5) and P(N(4) >.5). 

(b) Could the autocovariance function of N(t) be ee Why or why not? 

(c) Suppose the autocovariance function of N(f) is min(~, s), i.e., Cyn(t, 8) = t for 
t<s ands for t>s. Find the correlation coefficient between M(t) and N(s). 
[Hint: consider the two cases t<s and t>s.] 

(d) Determine the probability distribution of M(s) — N(d). 

Consider the phase-variation random process (7.3) with Ag = | and wp) = 2z. 

(a) Use the results of Example 7.12 to show that, for fixed r, X(t) does not have 
a uniform distribution. [Hint: What is the interval of possible values for 
X(t)? If X(t) were uniform, what would its variance be?] 

(b) Use the transformation method of Sect. 3.7 to show that the rv Y= X(0) has 
an arcsine distribution: 


1 


1 


[Note: It can be shown that X(t) has this same distribution for all ¢.] 
Let A(t) be a random process, and define an “amplitude modulated” version of 
A(t) by X() = A(Acos(@pt+ ©), where © ~ Unif(—zx, 1m] and is independent of 
A(t), and q@po is a constant. 
(a) Determine the mean function of X(t). 
(b) Determine the autocorrelation function of X(f). 
(c) Determine the cross-correlation of A(t) and X(t). 
[Hint: Use the results of Example 7.12.] 
Consider a “signal plus noise” process where both components are random: 
X(H)= S(+NO. Assume S(f) and N(A) are uncorrelated random processes. 
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Determine each of the following functions in terms of the mean, autocorrelation, 
etc. of S(f) and N(A). 

(a) The mean function of X(‘). 

(b) The autocorrelation function of X(¢). 

(c) The autocovariance function of X(f). 

(d) The variance function of X(f). 

20. Consider the random process X(t) = S(t) + N(t) from the previous exercise. Find 
the cross-correlation between the signal component S(t) and the overall process 
X(t). 

21. Consider two random processes X(t) and Y(t). 

(a) Show that if X(f) and Y(t) are uncorrelated random processes, then 
E[X(OY(s)] = ux(Our(s). 

(b) Show that if X(4) and Y(t) are uncorrelated random processes and X(f) has 
mean function equal to zero, then X(f) and Y(f) are orthogonal. 

22. Let A(t) and B(?) be iid processes, i.e., A(t) and B(f) are independent processes 
with the same mean function p(t), autocovariance function C(t, s), etc. Define a 
pair of new random processes by 


X(t) = A(t) + Blt 
Y(t) = a0 a 
(a) Find the mean functions of X(¢) and Y(¢). 
(b) Find the autocovariance functions of X(t) and Y(f). 
(c) Find the cross-covariance function Cyy(t, s). 
23. Let © be a uniformly distributed rv on (—z, 2]. Define two random processes 
X(t) =cos(@ot + @) and Y(t) = sin(@pt+@). 
(a) Find the cross-correlation and cross-covariance of X(t) and Y(t). 
(b) Are X(t) and Y(t) orthogonal random processes? Uncorrelated random 
processes? Independent random processes? 
24. Let Ryy(t, s) be the cross-correlation function of X(t) with Y(t), and define the 


cross-correlation of Y(t) with X(t) by Ryx(t, s). Show that Ryx(t, s) =Ryxy(s, 0). 
Show that a similar relationship holds for cross-covariance. 


7.3. Stationary and Wide-Sense Stationary Processes 


When modeling certain random processes, particularly those representing noise, it 
facilitates the analysis if the statistical properties of the process remain the same 
across time. This turns out to be true for some, though certainly not all, models. We 
will make this notion more precise shortly, but first let’s revisit three of the 
examples from the previous section. The relevant graphs are presented in 
Fig. 7.11 below. Figure 7.11a shows the ensemble of the phase-variation random 
process from Example 7.12. Notice that the probability distribution of X(t)— 
remember, that’s the distribution in the vertical direction—appears to be the same 
at each time point f. Figure 7.11b shows just the noise component, M(t), from 
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Fig. 7.11 Three ensembles: (a) the phase-variation process from Example 7.12; (b) the noise 
component from Example 7.9; (c) the noise component from Example 7.10 


Example 7.9. Again, we see roughly the same ensemble behavior at every time f¢, 
suggesting the process’ statistical properties do not change over time. In contrast, 
consider the noise component N(f) from Example 7.10, displayed here in Fig. 7.1 1c. 
While the mean of N(f) is constant, its variance clearly increases with ft; this model 
does not possess the property of interest. 

We now formalize the notion of stable behavior over time. 


DEFINITION 

A random process X(f) is (strict-sense) stationary if all of its statistical 

properties are invariant with respect to time. More precisely, X(f) is stationary 

if, for any time points ¢,, ..., ¢, and any value 7, the joint distribution of X(¢,), 
.., X(t,) is the same as the joint distribution of X(t, +7), ..., X(t.+7). 


This definition requires that the statistical behavior of X(‘) remain the same if we 
“translate” the random process 7 time units. In particular, it requires that X(¢,) and 
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X(t; +7) have the same distribution for all t,; and 7; it follows that X(t) must have the 
same mean, standard deviation, etc. at all times ¢. This corresponds to what we see 
in Fig. 7.1 1a, b, but not c. Notice, however, that the definition requires more: since 
the joint distribution of X(t,) and X(¢2) must be translation-invariant, it follows that 
the autocovariance function of X(t) must be translation-invariant as well. We 
certainly cannot determine this from a visual inspection of the ensemble. 

In fact, it is rarely practical to determine whether a particular random process 
model is strict-sense stationary, since it requires an unlimited number of 
comparisons (joint distributions of r variables at all time-points and for all possible 
r). Fortunately, a weaker version of stationarity suffices for the purposes of many 
analyses. 


DEFINITION 

A random process X(f) is wide-sense stationary (WSS) if the following two 
conditions hold: 

1. The mean function of X(t), x(f), is a constant. 

2. The autocovariance function of X(4), Cyx(t, 5), depends only on s —t. 


We interpret condition 2 as follows: the degree of association between X(f) and 
X(s), as measured by covariance, depends on how far apart the two times s and t are, 
but not where those times are located on an absolute scale. So, for example, the 
covariance between X(3) and X(10) is the same as the covariance between 
X(23) and X(30) when condition 2 is satisfied (since, in both cases, s — t= 7). 

Condition 2 of this definition can be stated more cleanly if we re-parameterize 
the second time variable. Let’s write s=f+7, so that t represents the difference 
between the two times s and ft. Then wide-sense stationarity requires that the 
autocovariance function Cyx(t, f+7) depend only on ¢ (and not on ¢). In fact, with 
this notation, we can define a wide-sense stationary process to be one such that both 
bx(t) and Cyx(t, +7) are independent of t. 

Before looking at some examples, we note that the defining conditions can be 
restated in terms of the autocorrelation function, Ryy: arandom process X(t) is WSS 
iff (1) #x(f) is a constant and (2) Ryx(t, t+) depends only on t. 


Example 7.13 Is the amplitude-variation random process, X(t) = Acos(2ar), wide- 
sense stationary? The graphs in Figs. 7.6 and 7.7 clearly indicate not. Indeed, in 
Example 7.11 we found the autocovariance of this random process to be Cxx(t, 5) = 
Var(A)cos(2nt)cos(2ms), which depends separately on ¢ and s, not just their differ- 
ence. Therefore, the amplitude-variation random process is not WSS. a 


Example 7.14 Is the phase-variation random process X(t) = Agcos(@pf + ©) from 
Example 7.12 wide-sense stationary? Using the results of Example 7.12, we can 
check the two required conditions: 
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1. x(t) =0, a constant. Thus the first condition is satisfied. 
2 


A 
2. Cyxx(t,s) = = cos(@ot — aos), So 


«2 ye: S 
Cxx(t,t+7) = = cos(@ot — wo(t+7)) = 3 cos(—awot) = 5 cos(@7). 


Since Cyx(t, +7) depends only on 7 and not on f, the second condition is met. 
Therefore, X(¢) is indeed wide-sense stationary. | 


Example 7.15 Let A and B be iid mean-zero random variables, and define a 
random process by 


X(t) = Acos(@ot) + B sin(@ot) (7.4) 
for some frequency wp. Is X(t) wide-sense stationary? The mean function is 


Hy(t) = E[Acos(wot) + B sin(@ot)| = E[A] cos(@ot) + E[B] sin(@ot) 
= 0cos(@ot) + Osin(a@ot) = 0 
Since the mean of X(f) is a constant, the first condition is met. Next, let’s 


consider the autocovariance function. Using the distributive properties of 
covariance, 


Cxx(t,s) = Cov (A cos (wot) +B sin(@ot) ,A cos (wos) +B sin(@os) ) 
= Cov (A cos(@ot), A cos (wos) ) + Cov (A cos (wot) ,B sin (wos) ) 
+ Cov (B sin(@ot), A cos (wos) ) + Cov (B sin (wot) ,Bsin (wos) ) 
= Cov(A,A) cos (aot) cos (wos) + Cov (A,B) cos (wot) sin (wos) 
+ Cov(B,A) sin (wot) cos (wos) + Cov (B,B) sin (wot) sin (wos) 


Since A and B are independent, Cov(A, B) = Cov(B, A) =0; since they’re iden- 
tically distributed, Cov(A, A) = Cov(B, B) = o”, the common variance of A and B. 
Using a trig identity, we arrive at 


Cxx(t, 8) = 07 cos(aot) cos(wos) + 0” sin(@ot) sin(wos) = 67 cos(wot — wos) 


= 0° cos(wo|t — s]) 

Since this depends only on the difference in the two times ¢ and s, the second 
condition is met. (In fact, we may simplify the autocovariance expression further, 
to o cos(@o[t — (t+7)]) = o cos(—@pot) = o cOs(@ot).) Therefore, yes, X(t) in 
Expression (7.4) is wide-sense stationary. a 


Example 7.16 (Example 7.15 continued) In the previous example, we proved that 
any random process of the form (7.4) is WSS, provided A and B are iid with 
mean zero. 

As a curious example, suppose A and B are independent and each is equally 
likely to be +1 or —1; this particular distribution has mean 0 and variance 1. Since 
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A and B are iid with mean 0, the random process X(f) in Eq. (7.4) is WSS, with mean 
0, autocovariance function Cyy(t, t+ t) = 6°cos(@pt) = cos(wor), and thus variance 
ox(t) =Cyx(t, f) = 1. However, since A and B each can only take on the two values 
+1, the entire ensemble of X(f) consists of just four functions: 


X(t) = £cos(@ot) + sin(wof) 


This ensemble appears in Fig. 7.12; its appearance does not match earlier 
pictures of WSS processes. But it is nonetheless true that the mean of the vertical 
coordinates at any time-point ¢ is 0, the variance is 1, and that the covariance 
between any two time points 7 units apart is cos(@o7). 


x(t) 
ry 


1.5} 


Fig. 7.12 The ensemble of X(t) in Example 7.16 


In part, the lesson here is that we cannot rely on a visualization of a random 
process to determine whether it’s wide-sense stationary—despite appearances, this 
really is a WSS process. That said, it’s clear that the probability distribution of X(t) 
is not the same for all ¢, e.g., the possible values of the process are {—1, 0, 1} at 
some f-coordinates and {-1 / V2, 1/ J2 } at others. So, while this random process is 
wide-sense stationary, it is certainly not strict-sense stationary. a 


In Chap. 8, we will study relationships between the behavior of the input X(f) toa 
filter and the resulting output Y(4); we will often require that X(f) be WSS. Two 
random processes X(t) and Y(f) are called jointly wide-sense stationary if (1) X(‘) 
is WSS, (2) Y(#) is WSS, and (3) the cross-covariance function Cyy(t, t + 7) does not 
depend on ¢. (Equivalently, X(#) and Y(t) are jointly WSS if they are both WSS 
processes and Ryy(t, f+ 7) is independent of f.) 
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7.3.1 Properties of Wide-Sense Stationary Processes 


By definition, a WSS random process has a constant mean (function). In fact, such a 
process also has constant variance, because ox(t) =Cyy(t, f) from a proposition in 
the previous section and wide-sense stationarity requires that this covariance not 
depend on ¢. With that and our previous discussions in mind, we adopt the following 
notational conventions for WSS processes. 


NOTATION 
Suppose X(f) is a wide-sense stationary process. Then we denote its statistical 
functions as follows: 


n= Cov(X(t),X(t+7)) = Rxx(z) — 
1) = E[X(t)X(t+1)] = Cxx(z) + uy 


Next we present some important properties of these functions. 


PROPOSITION 
Let X(t) be a wide-sense stationary process with autocovariance function 
Cyx(z) and autocorrelation function Ry,(z). 


Properties of Cyx(z): 

1. Cxx(0) = E[X?(0)] — ph = 0% = Var(X(d). 

2. Cyx(—t) = Cxx(7); that is, the autocovariance function is symmetric in t. 

3. ICyx(7)l < Cxx(0) for every 7; that is, the autocovariance function achieves 
its largest value at c= 0. 

4. If X(d is periodic, so is Cyx(t), and with the same period. 

5. If X(d) is ergodic’ and has no periodic component, then Cyy(t) — 0 as 
Itl— oo. 


Properties of Ryx(z): 
1. Ryx(0) =E[X7(0)], called the mean square value of X(‘). 
2. Ryx(—t) = Rxx(c); that is, the autocorrelation function is symmetric in t. 


(continued) 


' Loosely speaking, a random process is ergodic if its time and ensemble properties “match.” We 
will define ergodicity more carefully later in this section; for now, you may assume the processes 
referenced in this section are ergodic unless noted otherwise. 
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3. IRxx(t)| < Ryx(0) for every 7; that is, the autocorrelation function achieves 
its largest value at z= 0. 

4. If X(d is periodic, so is Ryx(t), and with the same period. 

5. If X(#) is ergodic and has no periodic component, then Ryy(t) > ae as 
Izl— oo. 


Proof We begin with the properties of Cyy(z). Property 1 follows from the 
covariance shortcut formula. To prove Property 2, recall from Chap. 4 that covari- 
ance is symmetric in its arguments: 


Cyx(t) = Cov(X(t),X(t+ 2)) = Cov(X(t+ 2), X(0) 


Because X(t) is WSS, the right-most expression depends only on the difference 
in the two times; specifically, Cov(X(t+ 7), X(Q) = Cyx(t — [t+7]) = Cxx(—1). This 
establishes the result. 

Property 3 is left as an exercise (see Exercise 38). We note that Property 3 makes 
intuitive sense: since covariance measures the association between two variables, 
and 7 represents the time distance between these two variables, covariance should 
be largest when that time difference is as small as possible. That is, the behaviors of 
X(t) and X(t+7) should be more closely related when t is small than when 7 is large. 

Properties 1-3 for the autocorrelation function follow automatically, since 
Ryxx(t) and Cxx(t) only differ by the constant Hx. 

Toward proving property 4, suppose X(t) is periodic with period d, so X(t) = 
X(t+d) for all ¢. Then, for any 7, 


Rxx(t +d) = E[X(t)X(t+7+4)] 
= E|X(t)X(t+7)] because X(t+ t+ d) = X(t+1) 
=> Ryx(t) 


which shows that Ry x(t) is also periodic with period d. The analogous property 
holds for autocovariance, because subtracting es to get Cyy from Ryx does not 
affect periodicity. 

A formal proof of Property 5 is beyond the scope of this book; however, the 
paragraph above regarding Property 3 should give some intuition for why covari- 
ance should vanish as Izl— oo. Some further information about “ergodicity” 
appears at the end of this section. a 


It’s important to note that while every autocovariance and autocorrelation 
function for WSS processes satisfy the properties listed in this proposition, these 
properties do not completely characterize such functions. That is to say, there exist 
functions that satisfy properties 1-5 but are not valid autocovariance/autocorrela- 
tion functions. We’ll explore this further in Chap. 8, when we connect autocorrela- 
tion and autocovariance functions to the power spectrum of a random signal. (For a 
preview, see Exercise 40.) 
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Example 7.17 Suppose X(t) is a wide-sense stationary random process with 
autocorrelation function 


16 
l+7 


Let’s determine as much as we can about the other statistical properties of X(f). 
First, the mean square value of X(f) is Ryy(0) = 100 + 16 = 116 (Property 1). Next, 
X(t) clearly has no periodic component; otherwise, Ryx(t) would also (Property 4). 
Thus we may apply Property 5: 


. 16 
= lim A eae) = be | i -3| = 100 +0 = 100 


|t|> t|—-00 
from which py either equals +10 or —10; notice that we cannot determine which is 
correct from Ryy(t). We can, however, determine the autocovariance function: 
16 16 


= y= a 
Cxx(t) = Ryx(t) Hx 100 I re ro) 100 1 ire Zz 


Notice this autocovariance function goes to 0 as Itl— oo, as guaranteed by 
Property 5. Finally, the variance of this random process is given by 
= Cyx(0) = 16, and the standard deviation is oy = 4. | 


Example 7.18 Partitioning a random process. Suppose X,(t) and X(t) are inde- 
pendent, zero-mean, WSS random processes with autocorrelation functions 
Ry4,(7) = 2000tri(10,0007) and R2(7) = 650cos(40,00027), respectively.” 

Define a new random process by X(t) = X\(4) + X2(t) + 40. The mean function of 
X(f) is 


fx(t) = E[X(t)] = E[Xi(t) + X2(t) + 40] = E[Xi(¢)] + E[X2(1)] + 40 
=0+0+40 = 40 


Determining the autocorrelation function requires some significant algebraic 
work: 


seas )] =£[(%1 (t) +X2 (2) +40) (X1 (8) +X2(s) +40)] 
=E[Xi( a [X1 (2) X2(s)] + £[40X1 (1) ] +£[X2(2)X1 (s)] 
ee 2(s)] + £[40X> (r)] + £[40X) (s)] + £[40X>2(s)] +£[1600] 
(7.5) 


Four of the terms in Expression (7.5) may be simplified by removing the constant 
40, e.g., E[40X\(4)] =40E[X ,()] =40(0) =, since X,(4) is a mean-zero process. The 
other three similar terms are also 0. Using the independence assumption, we can 


? Readers not familiar with the triangular or “tri” function should consult Appendix B. 
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rewrite the second term in Eq. (7.5) as E[X,()X2(s)]=E[X,(OJE[X2(s)] = (0)(0) =0. 
The last term in the middle line of Eq. (7.5) is 0 for the same reason. In fact, only 
three terms do not vanish: 


Rxx(t,s) = E[X1(t)X1(s)] + E[X2(¢)X2(s)] + £[1600] 
= Ri (t,s) + R2(t,5) + 1600 
= Ri1(t) + Rx (z) + 1600 because X(t) and X(t) are WSS 
= 2000tri(10, 0007) + 650 cos (40, 000xr) + 1600 (c=s—t) 


That is, the autocorrelation function of X(t) equals the sum of the 
autocorrelations of X(t) and X2(t), plus the square of the constant term 
(40? = 1600). Since the mean of X(f) is a constant, 40, and the autocorrelation 
function of X(t) depends only on 7, the random process X(t) is indeed wide-sense 
stationary. 

Finally, we can easily find the autocovariance function (which, since X(f) is 
WSS, only depends on 7): 


Cxx(t) = Rxx(t) — 12. = Rxx(z) — 40? = 2000tri(10, 0007) + 650 cos(40, 000x7) 


Notice that, since X(t) and X>(t) have mean zero, the two terms in Cy (z) are, in 
fact, their respective autocovariance functions. 

Let’s examine this example further. The random process X(f) consists of three 
components: X,(¢), which is not periodic (since R,,(z) isn’t); X2(t), which is 
periodic; and a constant. It’s often the case that we can partition a random process 
in this manner: 


X(t) = {aperiodic components} + {periodic components} + {constant} 


Any or all of these three elements may be present, and the first two components 
may themselves be sums of other parts, e.g., the sum of several sinusoids with 
different periods may comprise the “periodic components” piece. In engineering 
language, the constant term is called the de offset: if X(¢) represents a current 
waveform, the constant term is the direct current in X(t), while the other two 
components comprise the alternating current (ac) of X(f). 

Now look at Ryx(z), which also consists of three parts: an aperiodic part (also 
called the dissipative component, since this is the term that goes to 0 as Izl — 00), a 
periodic part, and the square of the dc offset. In general, the autocorrelation 
function of a WSS process can be decomposed into 


Rxx(t) = {dissipative components} + {periodic components} + {constant} 
(7.6) 


The constant term in Eq. (7.6) is called the de power offset, since it reflects the 
power that results if the dc offset were to pass through a 1-Q resistance (viz., 
P=PR=40°(1)= 1600). Finally, the autocovariance function of X(‘) includes 
only the first two parts of Eq. (7.6), dissipative and periodic components, and not 
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the dc power offset. In a sense, Cyy(t) tells us something about the ac power in our 
random current waveform. We will explore these ideas much further in Chap. 8. 


7.3.2 Ergodic Processes 


As we’ll see in this chapter and the next, it is desirable to understand the statistical 
properties of a random process—mean, variance, and so on. But because these are 
properties of the ensemble, a problem arises: in practice, we generally only observe 
a single realization of the process, and so we must somehow reconstruct the 
ensemble properties from this one signal. Thankfully, many stationary random 
processes have the feature that their time and ensemble properties match (e.g., the 
time average of a single realization equals the ensemble mean). A process with this 
feature is called ergodic. 

To give some intuition for this concept, imagine you have a large collection of 
identical dice. If you did not know how dice behave, you would have two options: 
roll all of the dice once, or roll one die many times. These should give us roughly 
the same information, but the first method illustrates ensemble properties (many 
realizations at one point in time) while the second gives time properties (one 
sampled die observed again and again over time). This process is ergodic. The 
benefit of ergodicity here is that we can learn about the behavior of all dice by 
purchasing (and repeatedly rolling) just a single die. 

To make the property of ergodicity more precise, we need to introduce the time- 
dimension analogues of mean, autocorrelation, etc. For a fixed value T>0, the 
average of a function x(t) over the interval [—T, T] is defined in calculus by 


WO)r= se] alae 


By allowing T to approach infinity, we may define the time average of a 
function x(t) by 


17 
(x()) = im (x(2)) = Jim 5 [x(a 

If the function x(f) is periodic, the time average defined above is equal to the 

average of x(f) across one period. 

The time average of a random process X(t) is defined by replacing x(f) in the 
above expression with X(t). The result is a quantity that clearly does not depend on 
time (since we have integrated dt) but which may vary across different members of 
the ensemble (i.e., the time average (X(f)) is still a random quantity). 
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DEFINITION 
A random process X(t) is mean ergodic if its time average and ensemble 
average are the same, i.e., if 


in the mean square sense.” 


Since the time average of X(t) does not depend on ¢, a necessary condition for 
mean ergodicity is for E[X(f)] to be a constant. Generally speaking, a random 
process must be stationary in order to be ergodic, but this is not always sufficient. 


Example 7.19 Let X(f)=Aocos(@pt+ @), where @~Unif(—x, a] and a0. 
We found in Example 7.12 that E[X(‘)] =0. Now consider the time average: 


(X(t))7- = (Ao cos(@ot + @)) > 


i A 
= =| Ao cos(@pt + ©)dt = —°_[ sin(wpT + ®) — sin(—aoT + ®)] 
2T J_r 2@oT 


Since the term in brackets is bounded no matter the value of ©, the factor of T in 
the denominator implies that (X(‘))r— 0 as T— oo, whence (X(t)) =0. Therefore, 
by definition, X(f) is mean ergodic. a 


Example 7.20 Consider a random dc signal, X(f)=X. That is, any particular 
sample function of the random process is some constant x, but that constant varies 
from realization to realization. This is trivially a stationary process; in particular, 
E[X(‘)] = E[X] = px, a constant. However, the time average of X on [—T, T] is just 
X, so (X(t)) =X, which is not a constant. Therefore, (X(t)) 4 E[X()], and X() is not 
mean ergodic. (To be precise, the time and ensemble averages would be equal in the 
mean-square sense if we had E[(X — Lx)" = 0, 1.e., if X had zero variance.) 

This matches with intuition: if the level of the dc signal X varies across different 
realizations of the process, then a single realization, X(t) =x, cannot tell us anything 
about the statistical behavior of the signal. = 


The preceding example indicates the most common situation wherein a random 
process is stationary but not ergodic: some element of the process is random but not 
time-varying. See Example 7.21 below, as well as Exercise 36, for other such 
examples. 


SA ry equals a constant c in the mean square sense if E[(Y — cy] =0. This is equivalent to 
requiring that E(Y) =c and Var(Y) = 0. In the definition above, Y = (X(f)) is the rv and c = E[X()] 
is the constant. 
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Similar to the definition of time average, we may define the time autocorrela- 
tion of a random process by 


T 
(X(X(t-+2)) = lim (X()X(t+ 2), = lim =| X(d)X(t-+ cat 
T-00 T-00 OT -T 
X(t) is said to be autocorrelation ergodic if (X()X(t+7)) = E[X(HX(t+7)] in 
the mean square sense. Since the time autocorrelation does not depend on f¢, a 
random process can only be autocorrelation ergodic if its autocorrelation function 
Ryx(t,t+7) is also free of t (the second condition for wide-sense stationarity). 


Example 7.21 (Example 7.19 continued) We showed in Example 7.14 that the 
random process X(t)=Apcos(@pf+@) has autocorrelation function Ry x(t) = 
(Ao/2)cos(wor). Applying an appropriate trig identity, we can also find its time 
autocorrelation: 


(X()X(t+7))- = at | 40 cos(@ot) -Agcos(@o(t-+ rt) )dt 


2 -T 
=F] 5leos(one) + cos(co(2t-+1))]at 


teins rc] +M[" cos(@o(2t+7))dt 
2T 2 2T J_r 

ae 0 [cos (a (2T+1)) — cos(@o(—2T+7))] 
2; 8a@o0T 


Taking the limit as T— oo, the second term above goes to zero, and we have 
(X(t)X(t+7)) =(Ao*/2)cos(ar), the same as the autocorrelation function of X(#). 
Hence, X(t) is autocorrelation ergodic. 

However, suppose we replace the constant amplitude Ag with a random ampli- 
tude A, 1.e., a random variable not varying with time. Then calculations similar to 
those in Example 7.12 and above show that 


2 x2 
cos(@pt) while (X(t)X(t+7)) = a Cos(@o7) 


Ryx(z) => = 


These are not equal—that is, now X(f) is not autocorrelation ergodic—unless it 
happens that FE [A7] = A? in the mean square sense, which is true iff Var(A)=0. @ 
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7.3.3 Exercises: Section 7.3 (25-40) 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


32. 


Define a random process X(t) = V + Agcos(@pt + @), where V and © are inde- 

pendent random variables; © ~ Unif(—z, 2]; and V has mean py and variance 

Ov. (That is, X(t) models a signal with both phase and dc variation.) 

(a) Find the mean function of X(¢). 

(b) Find the autocorrelation function of X(). 

(c) Is X(t) wide-sense stationary? 

Define a random process X(t) = Acos(@pt + ©), where A and © are independent 

random variables; © ~ Unif(—z, x]; and A has mean yp, and variance O74. (That 

is, X(t) models a signal with both phase and amplitude variation.) 

(a) Find the mean function of X(‘). 

(b) Find the autocorrelation function of X(t). 

(c) Is X(t) wide-sense stationary? 

Determine whether each of the following functions could potentially be the 

autocovariance function of a WSS random process. 

(a) cos(t) 

(b) sin(z) 

(c) 1/(1 +2’) 

( d) et 

Determine whether each of the following functions could potentially be the 

autocovariance function of a WSS random process. 

( a) etl 

(b) 7 

(c) tri(z), defined by tri(z) = 1 — Icl for Iz|< 1 and 0 otherwise 

(d) sinc(z), defined by sinc(0) = 1 and sinc(r) = sin(xz)/(xr) for 740 

Let A and B be iid random variables, and define a random process by 

X(t) = Acos(@ot) +B sin(wof). Is X(t) necessarily WSS? Why or why not? 

Define X(t)=At+B, where A and B are independent, A ~ Unif[—3, 3], and 

B~Unif[—10, 10]. 

(a) Find the mean function of X(¢). 

(b) On the basis of (a), can you determine whether X(7) is WSS? If so, what is 
your determination? 

(c) Find the variance function of X(t). 

(d) On the basis of (c), can you determine whether X(7) is WSS? If so, what is 
your determination? 

Let A(#) and B(f) be jointly wide-sense stationary random processes, and 

define a new process by X(t) = A(t) + B(). Find the mean and autocovariance 

functions of X(t). Is X(t) WSS? 

Let A(t) and B(f) be jointly wide-sense stationary random processes, and 

define a pair of new processes by 
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33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 
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oe 


Are X(t) and Y(t) jointly wide-sense stationary? 

A wide-sense stationary process Y(t) has mean —7 and autocovariance function 

Cyy(t) = 50cos(10027) + 8cos(600z7). 

(a) Does Y(t) have any periodic components? How can you tell? 

(b) Find Cov(Y(0), Y(0.01)). 

(c) Find the autocorrelation function of Y(f). 

(d) Find the mean square value of Y(¢). 

(e) Find the variance of Y(t). 

A wide-sense stationary process X(f) has autocorrelation function Ryy(t) = 60 

4 125e7 1/100 

(a) Does X(t) have any periodic components? How can you tell? 

(b) Find the mean square value of X(?). 

(c) Find the mean of X(¢), if possible. 

(d) Find the autocovariance function of X(¢). 

(e) Find Cov(X(10), X(15)). 

(f) Find the standard deviation of X(f). 

Consider the random process X(t) = A cos(@pof) + B sin(@ot), where A and B are 

iid random variables with mean zero. In Example 7.15, we showed that X(f) is 

wide-sense stationary. Is X(t) mean ergodic? 

Let X(t)=A- Y(t), where A is a random variable and Y(f) is an ergodic, WSS 

random process independent of A. 

(a) Find the mean and autocorrelation of X(t) in terms of the properties of A and 
Y(a). Is X(t) WSS? 

(b) Show that the autocovariance function of X(f) is given by Cy,(t)= 
E(A*)Cyy(t) + 04 Hy. 

(c) Find the time average of X(t). Is X(t) mean ergodic? 

(d) Assume Y(f) has no periodic component, so its autocovariance function 
goes to 0 as Itl— oo. Does the same hold true for the autocovariance 
function of X(t)? Why is this not a violation of property 5 of WSS 
processes? 

Recall from Chap. 4 that the correlation coefficient of two rvs X and Y is given 

by p(X, Y) = Cov(X, Y)/oxoy. For a WSS random process X(), find the correla- 

tion coefficient of X(t) and X(t+7r) in terms of the autocovariance function 

Cxx(7). 

Let X(t) be a WSS random process. 

(a) Prove that the autocovariance function Cyy(t) satisfies ICyy(z)l < Cyx(0). 
[Hint: Use the previous exercise. ] 

(b) Prove that the autocovariance function Ryy(r) satisfies IRyx(z)l < Rxx(0). 

Let X(¢) and Y(t) be jointly wide-sense stationary. Show that Ryy(z) = Ryx(—7) 

and Cxy(t) = Cyx(—7). 

Let X(t) be a WSS random process. 

(a) Show that for any constants a), ..., a, and any time points f), ..., ty, 
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n 


Var(a1X(t1) feet aX (tn)) = be So ajarCxx (tj = tk) 
j=l k=l 


(b) Explain why a valid autocovariance function must be positive semi- 
n n 


definite, i.c., Cyx(r) must satisfy $7 $0 ajaxCxx(t) —%) > 0 for all 
j=l k=1 
constants a,,..., d@, and times ¢,,..., f). 
(c) Now consider the “rectangular function” defined by 


rect(t) = { : |el< 1/2 


0 otherwise 


Show that rect(r) satisfies the properties of an autocovariance function 
expressed in the main proposition of this section. 

(d) By considering n= 3, a; =a3=1, dg =—1, } =—.3, 2=0, t3=.3, show 
that rect(z) is not positive semi-definite and, hence, cannot be the 
autocovariance function of any WSS random process. 

[Note: It can be shown that positive semi-definiteness completely characterizes 
the collection of all valid autocovariance functions. That is, any valid 
autocovariance function is automatically positive semi-definite, as shown in part 
(b), and for any positive semi-definite function C there exists a WSS random 
process whose autocovariance function is C(z).] 


7.4 Discrete-Time Random Processes 


Much of what we have discussed in the previous two sections applies equally to 
discrete-time random processes (aka random sequences). After reviewing 
definitions and properties for the discrete-time case, we introduce a few specific 
examples of important discrete-time models. 

Recall from Sect. 7.1 that a random sequence is simply a list of random variables 
X,, X2, and so on; we write X,, for the general term. (We may also define a sequence 
indexed by the entire set of integers: ..., X_2, X_1, Xo, X1, X2, ....) The subscript 
takes the place of the time index ¢t from earlier. The sequence can also be written as 
X(1], X[2], ... or X[n] to mirror the continuous-time notation; the square brackets 
will remind us that we’re working on a discrete-time scale. 


Example 7.22 Let’s return to the value of Apple Inc. stock, but now consider only 
recording the closing price at the end of each trading day (starting, say, January 2 of 
next year). If we define X,, to be the closing price of Apple stock on the nth recorded 
day, then we can model X), Xo, ... as a random sequence. Figure 7.13 shows one 
possible sample function of this random sequence, assuming the closing price just 
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prior to our day 1 was $580. In effect, we have converted a continuous-time process 
(analog) into a discrete-time process (digital) by sampling the process from Exam- 
ple 7.2 at designated times. 


x(t) 
rN 
640 | 
6305 a 
6205 eo’ le 
6105 Piss 
600 | es 


590 |} 


Fig. 7.13 A sample function of the random sequence of Example 7.22 a 


For any random sequence, we can define several statistical functions for times 
n=1, 2,3,... as follows: 
¢ Mean function: vy[n] = E[X,] 

* Variance function: ox[n] = Var(X,,) 

¢ Standard deviation function: ox[n] = \/ Var(Xn) 

¢ Autocovariance function: Cyy[n, m] = Cov(X,,, Xm) 
¢ Autocorrelation function Ryx[n, m] = E[X,X,,] 

The relationships between these functions established in the continuous-time 
case still hold, e.g., oxln| =Cyy[n, nj, and Cyx[n, m] = Rxx[n, m] — py[n]ux[m]. 

A random sequence is (strict-sense) stationary if the joint distribution of 
X[n4],. . ..X[,] equals the joint distribution of X[n, +k], ..., X[,+k] for any indices 
n,..., and any integer k. A random sequence is wide-sense stationary if 
(1) #x[n] is a constant, wy, and (2) Cyx[n, m] depends only on the difference 
m—n,; if we call this difference k, we may then denote the autocovariance function 
as Cxx[k]. As was true for continuous-time processes, we may make an equivalent 
definition regarding the mean and autocorrelation functions. 


Example 7.23 Any Markov chain from Chap. 6 is an example of a random 
sequence, provided the states for the chain are truly quantitative (e.g., counts or 
dollar amounts and not indicators for locations). Figure 7.14 shows two sample 
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Fig. 7.14 Two sample 
functions of a Gambler’s Ruin 
random sequence 


functions for the Gambler’s Ruin chain used repeatedly in Chap. 6, assuming 
Allan’s initial stake is $3, Beth’s is $2, and p=.5; the connecting line segments 
are just to help distinguish the two iterations. For each nonnegative index n, X,, 
equals the amount of money held by Allan after n games have been played. 

The mean function py[”] is the mean value of Allan’s fortune after n games have 
been played. For example, with an initial stake of $3 (Xo =3), X; equals $2 or $4 
with probability .5 each, so py[1]=$2(.5)+$4(.5)=$3. By considering the 
outcomes of the first two games, we find X> to be $1, $3, or $5 with probabilities 
.25, .5, and .25, respectively; this gives y[2] = $3 as well. In fact, it turns out that 
Hx[n] = $3 for all n under the specified conditions, even though the probability 
distribution of X,, changes with n. (In particular, the distribution of X,, converges to 
p(O)= 4 and p(5) =.6, i.e., Allan’s long run chance of winning all $5 at stake is 
60%.) 

Similarly, we can compute the variance function of X,,; using the distributions 
described in the previous paragraph, it’s straightforward to show that ox[0] =0, 
ox 1] =1, oxl2] =2, and oxln] — 6 as n— co. That the variance function is not 
constant is sufficient to show that the Gambler’s Ruin Markov chain is not a WSS 
random sequence. a 


7.4.1 Special Discrete Sequences 


Perhaps the simplest type of random sequence is the Bernoulli random sequence: 
let X,, Xo, ... be iid, with each X,, following a Bernoulli(p) distribution. A sample 
function of a Bernoulli random sequence with p = .6 appears in Fig. 7.15. By grace 
of the variables being iid, a Bernoulli random sequence is trivially (strict-sense) 
stationary; in particular, we have 
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Fig. 7.15 A sample function Xe 
of a Bernoulli random ry 
sequence 
1h + + * + + * ee * + + +* 
OF * * * +* * + * * 
i i i i i ! 1 f fl Len 


Hy[n| = ElX,,] = p,ox|n] = Var(X,) = p(1 — p), and 


Cxx[n, m] = Cov(X,Xm) = ores ioe 


A more general iid sequence will also be strict-sense stationary, but the formulas 
for the mean and variance will, of course, depend on the underlying common 
distribution of the X,,. 

From an iid sequence, we can construct a much more interesting random 
sequence as follows: define S$; =X,, S:=X,+X2, and so on, so that S,=S,_ 1+ 
X,= ere for all n > 2. The resulting sequence of partial sums S,, S2, $3, ... is 
called a random walk. For example, from Chap. 2 we know the sum of iid 
Bernoulli rvs is binomial; hence, if X, represents a Bernoulli random sequence, 
then the corresponding random walk S,, has a Bin(n, p) distribution for each n. 

We can use the properties of iid sums to derive some general properties of 
random walks. 


PROPOSITION 
Let X;, Xz, ... be an iid sequence with common mean py and common 
variance oR. Let S,, be the associated random walk, i.e., S,=X,+---+X, 
for every n. Then 
1. sin] = ElSp] =npex 
2. o¢[n] = Var(S,) = nox 

2 


3. Css[n, m| = min(n, m) oy 
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Proof The proofs of properties 1 and 2 were given in advance of the Central Limit 
Theorem discussion in Sect. 4.5. To prove property 3, assume m > n and proceed as 
follows: 


Css[n,m] = Cov(Sn,Sin) = Cov(X) foe +Xn,X1 apes +Xmn) 


= Cov(X; alr Xn, X1 AX, +Xngi te: +Xm) 
= Cov(X, a ae +X7,X1 a aaa +X,) +Cov(X, a eas +Xn,Xn41 apa +Xn) 
= Cov(Sn,Sn) + Cov(X, Se +Xn,Xn+1 a +Xm) 


In the third line, we have used the distributive property of covariance. The first 
term, Cov(S,,, S,,), is simply Var(S,,) (the covariance of S,, with itself). In the second 
term, the X;s ((=1 to n) in the first argument are independent of the X;s (j=n+1 to 
m) in the second argument; therefore, that covariance equals zero. Thus, we have 
Css[n,m] = Var(S,,) =nox. 

This holds for m>n; if m<n, the same argument yields Css[”, m]= mox. 
Therefore, for general n and m we may write Css[n, m] = min(n, m)ox. | 


Example 7.24 Let X,, Xo, ... be iid, with each X; being +1 or —1 with equal 
probability. The resulting random walk S,, is called the simple symmetric random 
walk in one dimension. Since each “step” X; has mean 0 and variance 1, it follows 
that ws[n] = n(0) = 0 and os] =n(1) =n. Several sample functions of S,, are shown 
in Fig. 7.16; the connecting line segments are just to help distinguish different 
iterations. Notice in Fig. 7.16b that the variability in S,, increases with n, but not 
linearly; this corresponds to the fact that SD(S,,) = ,/n. We also see, especially for 
larger n, that values of the ensemble of S,, are more concentrated near O and sparser 
at the edges. This is a consequence of the Central Limit Theorem: since the X;s are 
iid, their sum S,, becomes increasingly normal as n increases. (In fact, this is true for 
any random walk.) 


-1 : -1 


Fig. 7.16 Simple symmetric random walks: (a) the first 30 steps for three sample functions; 
(b) the first 200 steps for 100 sample functions a 
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7.4.2. Exercises: Section 7.4 (41-52) 


41. 


42. 


43. 


44, 


45. 


Let T,, denote the high temperature (°F) in Sacramento, CA on the nth day of 
the year. Consider the following model: 


20 
= 25 si ae —1 4 no 
75 + 25 sin (4 (n 50)) + 4e 


where the e,, are a sequence of iid N(0, 1) rvs. 

(a) Determine the probability that the high temperature on February 28 exceeds 
60 °F. 

(b) Find E[T,,]. 

(c) Find Cy;[n, ml]. 

(d) Is T,,a WSS random sequence? Should it be? 

The output of a certain amplifier, sampled every second, has the form 


X[n] = Ao sin(@on) + Z[n}, 


where the noise component Z[n] is a sequence of iid N(0, o) rvs for some o > 0. 

(a) Find the mean function of X[n]. 

(b) Find the autocovariance function of X[n]. 

(c) Is X[m] wide-sense stationary? 

A gambler plays roulette, betting $5 on black every time (so, she has probabil- 

ity 18/38 of winning on any particular spin). The gambler receives $5 for each 

win and gives up $5 for each loss. 

(a) Define a random sequence S,,= the number of games this gambler has won 
after n spins. Find the mean, variance, autocovariance, and autocorrelation 
function of S,,. 

(b) Define a random sequence Y,, = the amount of money this gambler has won 
after n spins. Find the mean, variance, autocovariance, and autocorrelation 
function of Y,,. 

(c) What is the probability the gambler is “ahead” after 10 spins (i.e., Yio > 0)? 

Gravel is being loaded onto rail cars by a dump truck for long-distance 

transport. Let X,, equal the amount of gravel (in tons) emptied onto the rail 

car by the mth dump truck run, and assume the X,, are iid Unif[15, 17]. 

(a) Define S,,=X,+---+X,,. Interpret S,, in this context. 

(b) Find the mean, variance, autocorrelation, and autocovariance functions 
of S,,. 

(c) Use the Central Limit Theorem to approximate both the distribution of S¢ 
and P(S¢ > 100), the chance the dump truck will be able to fill a 100-ton rail 
car in 6 runs. 

Let X(t) be a WSS random process. For some fixed T, > 0 define X[n] = X(nT,), 

so that X[1] is a “sampled” version of X(t). Show that the random sequence X[n] 

is also WSS. 
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46. 


47. 


48. 


49. 


50. 


A subsample of a random sequence X[n] is obtained by observing every Ath 

element of the sequence, for some integer k>1. The resulting random 

sequence, Y[n], is given by Y[n] = X[kn]. 

(a) Find the mean and autocorrelation functions of Y[n] in terms of those of 
X[n]. 

(b) If X[”] is WSS, is the subsample Y[n] also WSS? 

Let X,, be a WSS random sequence, and define a simple moving average 

sequence Y,, by 


X), + Xn-1 
Yn, =A 
2 


(a) Find the mean function of Y,,. 

(b) Find the autocovariance function of Y,,. 

(c) Is Y,, wide-sense stationary? 

(d) Find the variance function of Y,,. 

Let X,, be a sequence of iid random variables, and consider a new random 
sequence Y,, given by 


1 1 1 
Y,= 5Xn + gant + gq hn-2 


Is Y,, WSS? 
Let the random sequence ..., X_», X_,, Xo, X1, Xo, ... be iid, with mean yp and 
variance o”. Define a first-order autoregressive sequence Y,, by 


Yn = aYn-1 + Xn, 


where lal < 1. 

(a) Show that, for N>0, Y,=a*Y,_ n+ dog @Xn—i- 

(b) Let N — oo in (a) to conclude that Y,, = ars =e 

(c) Find the mean function of Y,,. 

(d) Find the autocovariance function of Y,,. 

(e) Is Y,, wide-sense stationary? 

(f) Find the correlation coefficient p(Y,,, Yn+,). 

Correlated bit noise. Let X, be a sequence of random bits (Os and 1s) 
constructed as follows: Xy)=0 or 1 with probability .5 each. For n> 1, 
Xn=X,_, With probability .9 and X,=1—X,_, with probability .1. (In the 
language of Chap. 6, this is a Markov chain with a symmetric transition 
matrix.) 

(a) Find the pmf of X,, and argue that this is also the pmf of X,, for all n> 1. 
(b) Is X,, a WSS sequence? 

(c) Find the mean and variance functions of X,,. 

(d) It can be shown, using techniques from Chap. 6, that for k > 0 
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51. 


52. 
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1+..3* 
4 


P(Xnsk => 1|X, => 1) = 


Use this to find Ryy[k] and Cyy[k]. 
Let X[n] be a random sequence whose time index n ranges across all integers 


..., -2, —1,0, 1, 2,.... Similar to the continuous-time case, the time average 
of X[n] over {—N,..., 0,..., N} is defined by 
1 N 
Xx = —— Xx 
( [”])y ON 4 co [n], 
which is just the arithmetic mean of X[—N], ..., X[N]. The (overall) time 


average of X[n] is then defined as a limit: (X[n]) =limy —, .(X[n])y. Assume 
X[n] is a WSS random sequence with mean py and autocovariance function 
CxxIk]. 

(a) Show that, for all integers N > 0, E[(X[7])w] = py. 


1 2N |x| 
(b) Show that Var((X[7]),) = ——— Cyxk] (: - ). [Hint: Use 


the relationship Var(Y) = Cov(Y, Y) and the distributive property of covari- 
ance to create a double sum (with indices m and n, say). Then make the 
change of variable k=m-—vn and rearrange the terms to create a single 
sum. ] 
Refer back to the previous exercise. A WSS random sequence X[n] is called 
mean ergodic if its time average (X[n]),y converges to fy as N— oo, in the 
sense that 


lim E| ((X(H))y — ax)’ | = 0 


n—-Co 


(a) Use the previous exercise to show that X[n] is mean ergodic if 
l 2N 
—— Cxx|k Oas N : 
N41, a ee 
(b) Let X[n] be WSS with autocovariance function Cyy[k] =a ap” for some 
a> 0 and Ipl < 1. Show that X[n] is mean ergodic. 


Poisson Processes 


In Sect. 2.5 we indicated that, under rather general conditions, the Poisson distribu- 
tion furnishes a probability model for the number of events of some sort (logins to a 
server, arrivals of radioactive pulses, flaws on the surface of a wafer, etc.) that occur 
in some fixed interval of time or region of space. We now present a more formal 
development of conditions that lead to the Poisson distribution in such contexts, and 
then explore properties of this event process. 
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DEFINITION 

Consider the experiment of observing randomly occurring events of some 

type continuously over time. Define X(0) = 0, and define X(t) for t > 0 to be 

the number of events that occur in the time interval (0, ¢]. X(4) is called a 

Poisson (counting) process if it satisfies the following two conditions: 

1. The numbers of events occurring in nonoverlapping time intervals are 
independent. 

2. There exists a parameter 1 > 0, called the rate of the process, such that the 
number of events occurring in any interval of length t has a Poisson 
distribution with mean Ar. 


Later in this section, we present an alternative definition of a Poisson counting 
process which does not explicitly assume that the event count follows a Poisson 
distribution. 

Condition 1 states that a Poisson counting process X(t) has independent 
increments: if (s,, ¢;] and (so, t2] are two intervals of time with t; <.s2, so that the 
intervals do not overlap, then the number of events that occur in the first interval is 
independent of the number of events that occur in the second interval. That is, the 
“increment” X(t,;) — X(s,) is independent of X(t.) — X(s2). 

Condition 2 states that for any f>0 and t>0, the number of events in the 
interval (t, t+7] has a Poisson distribution with mean Az. Since this count is 
represented by X(t+7)— X(t), we may write X(t+7) — X(t) ~ Poisson(Ar). Since 
the distribution of this “increment” depends only on 7 and not ¢, we say that a 
Poisson counting process has stationary increments. By substituting t=0 and 
tT=t into this expression, we have that X(t) ~ Poisson(/#), so that at each time ¢ the 
process itself has a Poisson distribution. It follows that 

y(t) =At and o%(t) = At (since mean and variance are equal for Poisson) 

A graph of a sample function of X(‘) appears in Fig. 7.17. It is clear both from 
Fig. 7.17 and the formulas above that a Poisson counting process is not stationary. 

Let us now derive the autocovariance function of X(t). We will rely on a clever 
trick; namely, for t << s we will split the interval (0, s] into the smaller intervals (0, ¢] 
and (¢, s]. A similar “trick” was used for a random walk in the previous section. 
Begin as follows: 


Cxx(t,8) = Cov(X(t),X(s)) = Cow(X(), X(t) + [X(s) — X(0)]) 
= Cov(X(1),X(t)) + Cov(X(),X(s) — X(t) 


In the second argument of the covariance function, we have separated X(s), the 
count of the number of events in (0, s], into two pieces: X(#), the number of events in 
(0, ¢], and X(s)—X(4), which represents the number of events in the interval 
(t, s]. Now, we simplify: the first term, Cov(X(f), X(4), is simply Var(X(t)); the 
second term, thanks to Condition 1, represents the covariance of two independent 
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Fig. 7.17 Asample function _—_x(A) 
of a Poisson (counting) 4 
process 


counts (since the intervals (0, ¢] and (¢, s] don’t overlap). Thus, the second term is 
zero, and we have Cyx(t, s) = Var(X(t)) +0 = At for t<s. 

If s <t, we can use the same argument to find the covariance equals As; therefore, 
the general expression for the autocovariance function of X(¢) is 


Cxx(t, 5) = Amin(t, S) 


Example 7.25 Database queries to a certain data warehouse occur randomly 
throughout the day. On average, 0.8 queries arrive per second during regular 
business hours. Assume a Poisson process model is applicable here. 

First, consider the number of queries in the first five seconds, X(5). The rv X(5) is 
Poisson with mean At = 0.8(5) = 4. Thus, for example, the chance of exactly three 
queries in the first five seconds is 


—443 
P(X(5) =3) =° y= 195 


Next, let’s find the probability of exactly 1 query in the first second and exactly 
2 queries in the four seconds thereafter, which requires the independent increments 
property: the number of queries in the first second and the number of queries in the 
four seconds thereafter are independent. More formally, X(1) and X(5) — X(1) are 
independent Poisson random variables with means 0.8(1) =0.8 and 0.8(4) = 3.2, 
respectively. Hence, 


P(X(1) = 19 X(5) — X(1) = 2) = P(X(1) = 1) - P(X(5) — X(1) = 2) 


-~08¢9 Qg1 4-322 92 
_é 0.8 é 3.2 — 075 
1! 2! 


Finally, consider the random variables X(10) and X(30). These two rvs are not 
independent: the time intervals (0, 10] and (0, 30] overlap. In fact, it should be 
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obvious that X(30) depends upon X(10), since X(30) counts the number of queries in 
the first 10 s, X(10), plus the number of additional queries in the 20 s thereafter. 
Intuition suggests that the two random variables are positively correlated, and we 
verify this now. 

The mean of X(10) is 0.8(10) = 8; since X(10) is Poisson, its standard deviation 
is then 8. Similarly, E[X(30)] = 0.8(30) = 24 and SD(X(30)) = /24. We can find 
the covariance of X(10) and X(30) through the autocovariance function above: 


Cov(X(10), X(30)) = Cxx(10, 30) = Amin(10, 30) = 0.8(10) = 8 
Finally, the correlation coefficient of X(10) and X(30) is 


— Cov(X(10),X(30)) 8 dL 57 
~ SD(X(10))SD(X(30)) V8V24. V3 


We’re not surprised to find a moderate, positive relationship between these two 
variables. As you might guess, the correlation coefficient will be largest when the 
two time intervals (here, (0, 10] and (0, 30]) overlap the most; if the time intervals 
of two increments only overlap to a very small degree, the resulting correlation 
coefficient will likewise be small (but still positive). | 


Corr(X (10), X(30)) 7 


7.5.1 Relation to Exponential and Gamma Distributions 


Because the events in a Poisson process occur “at random,” there is a second type of 
random variable we may wish to model: the time between events. Consider 
Fig. 7.18, which illustrates a Poisson process: the symbols along the time axis 
indicate the occurrences of events. Along the time axis, we have indicated several 
random variables: T, =the time until the first event occurs, measured from t= 0; 
T> = the time between the first and second events; T3 = the time between the second 
and third events; and so on. These random variables T,, T>, ... are called the 
interarrival times of the process. Unlike the Poisson count of events, which is 
discrete, each of these random time lengths is a continuous random variable. 
Thanks to the following theorem, their probability distribution is known. 


THEOREM 

Suppose events occur in accordance with the conditions of a Poisson counting 
process. Define 7,;=the time until the first event occurs and, for n> 2, 
T, = the time between the occurrence of the (m — 1)th and nth events. Also, 
define Y,,= the time until the nth event occurs, starting at t=0. Then 


(continued) 
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Fig. 7.18 Interarrival times in a Poisson process 


1. T,, T, .. . are independent exponential random variables with parameter A 
(mean 1/) and 

2. Y, is a gamma random variable with parameters a =n and / = 1/A (aka the 
Erlang distribution). 


Proof Since the time intervals spanned by T,, T>, etc. do not overlap, the T,, are 
independent by condition 1 of a Poisson process (independent increments). To find 
the distribution of T,, start with its cdf: for t > 0, 


Fr,(t) =P(T; <t) =1-P(T > 2) 
= 1 — P(no events occur in the time interval (0, ¢}) 
= 1-P(X(t) =0) where X(t) ~ Poisson (Ar) 


e*(at)° —At 


d d 
Fall) = Gn (0) = 5 [bea] = ae 


This is the exponential(A) pdf, as claimed. The distribution of T is also expo- 
nential(A) because, thanks to Condition 1 of the definition, we may “restart the 
clock” when the first event occurs and derive the pdf of T> in the exact same manner 
as above. Propagating this idea forward, we have that T,, ~ exponential(A) for all n. 

As for Y,, notice we may write Y, =7T,+---+T7,, which implies Y,, is the sum 
of n iid exponential(A) rvs. Exercise 65 in Sect. 4.3 showed, using moment- 
generating functions, that the sum of n iid exponential(A) rvs has a gamma(n, 1/A) 
distribution. a 


Exercise 72 offers a direct proof of Statement 2 of the preceding theorem, 
without relying on moment-generating functions or Statement 1. 


Example 7.26 Consider again the database queries described in Example 7.25. 
Rather than investigate the number of queries in preset time intervals, let’s look at 
the random arrival times themselves. The average time between successive queries 
can actually be deduced without the preceding theorem: if queries arrive at 0.8 
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queries/second on average, then the mean time between queries is clearly just the 
reciprocal: 1/0.8 = 1.25 s. If we let T= the time between queries, we now know that 
T ~exponential(0.8), whence E(T) = 1/A = 1/0.8 = 1.25 s and SD(T) = 1/A=1.25 s 
as well (remember that an exponential random variable has identical mean and 
standard deviation). 

Next, let Y59=time to the 50th query, starting at the beginning of regular 
business hours. The preceding theorem tells us Y59~gamma(50, 1/0.8), so 
E(Ys50) = 50(1/0.8) = 62.5. We expect the 50th query to arrive 62.5 s into regular 
business hours. The arrival time of the 50th query has a standard deviation of 


SD(¥50) = Vap" = \/50(1/0.8)? = V78.125 = 8.84s. 

If 50 or more queries arrive within the first minute, system users will experience 
a significant backlog in subsequent minutes because of processing time. What is the 
probability this happens? A backlog occurs iff Ys) < 1 min = 60 s. The probability 
of this event, evaluated using software, is 


60 1 
450-1 p—0.8x dx =--- = .4054 


oe I (50 — 1)(1/0.8)™ 


Alternatively, return to the original Poisson process: a backlog occurs iff X(60), 
the number of queries in the first 60 s, is 50 or more. Since X(60) has a Poisson 
distribution with mean 60(0.8) = 48, 


49 —48 4ox 
48 
P(X(60) > 50) = 1— P(x(60) < 49)=1-5~ — = 1.5946 = .4054 
x=0 a 


We have described a Poisson process as modeling the count of events that occur 
“at random” across time. This notion can actually be made more precise: in a 
Poisson process, given that an event has occurred by time to, it is equally likely to 
have occurred anywhere in the interval (0, to]. To see this, suppose we know that 
exactly one event has occurred by time fo, so X(to)=1. Conditional on that 
knowledge, let’s find the distribution of the random variable 7; = arrival time of 
this event. Begin with the (conditional) cdf: for t < f, 
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P(T; < t™X(to) = 1) 


P(T, < t| X(to) = 1) = 


P(X(to) = 1) 
Pit event occurred in (0, t] and none in (a to]) 
- P(X(t) = 1) 
_ P(X(t) = 19. X(to) — X(t) = 0) 
P(X(to) = 1) 
e*(ar)! eA) (A(ty — 1))° 
1! 0! t 
? e-#0(Ato)' to 
1! 


Differentiating with respect to ¢, we find the conditional distribution of T, given 
X(to) = 1, is 1/to, the uniform pdf on (0, fo]. 

Generalizing this argument, conditional on X(t) =n (i.e., on exactly n events 
occurring in (0, fo]) each of the n event occurrence times is uniformly distributed on 
(0, to]. Moreover, the n times are independent of one another. In light of this 
uniform distribution property, it is fair to say that for a Poisson process, events 
really do occur “at random.” 


7.5.2. Combining and Decomposing Poisson Processes 


In Sect. 4.3, we showed that the Poisson distribution is additive, i.e., the sum of two 
independent Poisson rvs is again Poisson distributed. This result immediately 
generalizes to Poisson counting processes. 


PROPOSITION 

Let X,() and X(t) be independent Poisson processes with rate parameters 1, 
and A, respectively. Define a new random process by X(f)=X,(t)+X2(0). 
Then X(f) is also a Poisson process, with rate parameter A, + Ao. (This theorem 
can be extended to the sum of k independent Poisson processes for k > 2 as 
well.) 


Example 7.27 Two roads feed into the northbound lanes on the Anderson Street 
Bridge. During rush hour, the number of vehicles arriving from the first road can be 
modeled by a Poisson process with a rate parameter of 10 per minute, while arrivals 
from the second road form an independent Poisson process with rate 8 cars per 
minute. If we let X(t) denote the total number of cars entering the northbound lanes, 
then X(f) is also a Poisson process, with rate parameter 10+ 8= 18 vehicles per 
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minute. Hence, the probability that a total of more than 100 vehicles will arrive via 
the two feeder roads in the first 5 min of rush hour is given by 


P(X(5) > 100) = 1 — P(X(5) < 100) = 1 3 e~ 18)(18(5)}* 


x=0 


= 1—.865 = .135 


x! 


This calculation is much simpler than considering all the possible ways the two 
individual Poisson processes could total more than 100 (e.g., 55 vehicles on the first 
road and 48 on the second road, and so on). | 


The foregoing proposition and example show that we can combine separate 
Poisson processes into a single Poisson process. Interestingly, we can also do the 
reverse: if we can categorize the events of a Poisson process (e.g., arrivals of people 
separated into women’s arrivals and men’s arrivals), then we can decompose the 
overall process into two “smaller” processes. We make this more precise in the next 
proposition. 


PROPOSITION 

Suppose events occur according to the conditions of a Poisson process, and 
that each event can be classified as either Type 1 or Type 2. Suppose that each 
event is Type | with probability p, independent of the types of all other events 
and independent of the number of events that have occurred. Define 
two random processes: X,(¢)= number of Type | events up to time ¢, and 
X>(t) = number of Type 2 events up to time ¢. Then 

1. X,(® is a Poisson process with rate parameter pA; 

2. X(t) is a Poisson process with rate parameter (1 — p)A; and 

3. X,(¢) and X>(f) are independent. 


Proof We will derive the joint distribution of X,(f) and X>(f), i.e., P(X, (4) =x and 
X2(t)=y), for arbitrary nonnegative integers x and y. The event {X,()= x and 
X(t) = y} is equivalent to the event 


{X(t) =x-+y and exactly x of these x+y events are Type 1} = ANB 


where X(f) denotes the overall Poisson process. The second part of this event 
follows a binomial model: we have a fixed number of trials (x + y), each with two 
basic outcomes (Type | or Type 2), plus independent trials and constant probability 
by assumption. Combining that with the known Poisson distribution of X(f) and the 
Multiplication Rule P(A NB) = P(A)P(B|A) gives 


644 7 Random Processes 


P(X(t) =x + y and exactly x of these x+y events are Type 1) 


ema? (x+y) oc — yy — 2G)? = py 
(x+y)! x JP a xty! 
_ PH Pa any pr = py’ _ eP(pasy et?) ((1 pyar)? 
a xly! a x! y! 
We recognize these two functions as the pmfs of a Poisson(pAr) distribution and a 


Poisson((1 — p)At) distribution, respectively. Moreover, since the joint pmf of X,(f) and 
X(t) separates into the product of individual pmfs, X,(f) and X>(f) are independent. 


Example 7.28 At a certain large hospital, patients enter the emergency room at a 
mean rate of 15 per hour. Suppose 20% of patients arrive in critical condition, 1.e., 
they require immediate treatment. Assume patient arrivals meet the conditions of a 
Poisson process. 

Let’s first find the probability that more than 50 patients arrive in the next 4 h. 
Let X(t) denote the Poisson process of patient arrivals (regardless of condition). 
Then X(4) has a Poisson distribution with mean pw = At = 15(4) = 60, so 


SO): 7 
e 6060" 


x! 


P(X(4) > 50) = 1 — P(X(4) < 50) =1 1 — 108 = .892 


x=0 


Next, we find the probability that more than 10 critical patients arrive in the next 
4 h. Let X,(¢) denote the number of critical (“Type 1”) patients that arrive within 
t hours. By the previous proposition, X(f) is a Poisson process with rate parameter 
pa= .20(15) = 3, so X,(4) is Poisson with mean 3(4) = 12. Thus, 


10 7 !242* 


x! 


1 — .347 = .653 


P(X:(4) > 10) = 1— P(X;(4) < 10) =1 
x=0 


Finally, to find the probability that more than 10 critical patients and more than 
40 noncritical patients arrive in the next 4 h, let X2(t) denote the number of 
noncritical (“Type 2”) patients that arrive within ¢ hours. Then X>(f) is also a 
Poisson process, but with rate parameter (1 —p)A=(1—.20)(15)=12. Thus 
X (4) ~ Poisson(48); moreover, X2(4) is independent of X,(4). Therefore, 


P(X, (4) > 109.X2(4) > 40) = P(X1(4) > 10) -P(X2(4) > 40) = (.653)(.862) = .563 


The calculation of P(X2(4) >40) is similar to those displayed above. =) 


7.5.3 Alternative Definition of a Poisson Process 


The definition of a Poisson process at the beginning of this section almost seems a 
tautology, since we said X(t) is a Poisson process if it has a Poisson distribution. The 
following theorem provides an alternative way to define a Poisson counting process. 
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THEOREM 

Consider the experiment of observing randomly occurring events of some 

type along continuous time. Define X(0) = 0, and define X(f) for t > 0 to be the 

number of events that occur in the time interval (0, ¢]. Suppose X(f) has the 

following properties: 

1. X(f) has independent and stationary increments. 

2. There exists 2 > 0 such that in any time interval of length h, the probability 
that exactly one event occurs is Ah+ o(h).* 

3. The probability of more than one event occurring in an interval of length 
his o(h). 

Then X(f) is a Poisson counting process with rate parameter A. 


Proof Because of the stationarity assumption, it suffices to consider a time interval 
beginning at time 0. Let P,(t) denote the probability that exactly k events occur in 
the interval [0, ¢]. First consider Po(t +h), the probability that no events occur during 
the first t+ A units of time. In order for this to happen, no events must occur in [0, f] 
and also no events must occur during the next / units of time. Since these two time 
intervals are nonoverlapping, the number of events that occur in the first interval is 
independent of the number that occur in the second interval. Thus 


Po(t +h) = Po(t) - P(no events in an interval of lengthh) 
= P(t) - [1 — P(exactly one event) — P(at least two events) | 
= P(t) [1 = (an + o(8)) — 0(0)] 
= Po(t) - [1 — Ah — o(h) — o(h)] = Po(t) - [1 — Ah + o(h)] 


Rearranging this expression gives 


Po(t + h) _ Po(t) 
h 


= AP 0(t) + oh) 


Now taking the limit as h — 0 gives the derivative of Po(t): 


Po(t) = —APo(t) 


This differential equation has the unique solution Po(t)= ce, where the 
constant c is determined by the initial condition P)(0) = 1. This implies that c= 1 
and thus that Po(t) =e’, which sure enough is the probability of no events when 


the distribution is Poisson with parameter At. 


“Readers not familiar with o(h) notation should consult Appendix B. 
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Now consider P,(f) for general k. In order to have k events occur in the interval 
[0, t+], it must be the case that either (1) k events occur in [0, f] and none in the 
next / time units, or (2) k — 1 events occur in [0, ¢] and one occurs in the next / time 
units, or (3) For />2, k—Z/ occur in [0, ¢] and / occur in the next / time units. 
By condition 3 in the theorem, the probability of the event in (3) is o(h). Writing 
P(t+h) as a sum of probabilities corresponding to cases (1), (2), and (3), 
rearranging, dividing by h, and taking the limit as h — 0 gives the following system 
of differential equations: 


PL(t) = —AP;(t)+APxs(t) k= 1,2,3,... 


Letting Q,(t) = Pte“, the above differential equation becomes Q/(t) = AQ, _ 1(f). 
This system can be solved recursively starting with Qo(t) = 1 (because Po(t) = oy to 
give O,(t) = A*‘*/k!, whence 


(At\'e—# 
= k= 01,225. = 


In this chapter, we are discussing temporal stochastic processes—that is, random 
processes that are functions of time. For some of these processes, in particular the 
Poisson counting process, there exist spatial analogues. A spatial Poisson process 
models the occurrence of “random” events in space, rather than in time (e.g., the 
location of flaws on an integrated circuit, or of trees in a forest). Analogous to the 
preceding theorem, suppose these random events meet the following conditions: 
(1) the numbers of events in nonoverlapping regions of space are independent; 
(2) the probability of exactly one event in a region of area h is Ah + o(h) for some 
A> 0; and (3) the probability of more than one event is a region of area h is o(h). 
Then a similar proof to the one above shows that the random variable X(R)= 
number of events in region R has a Poisson distribution with mean /- (area of R). 


7.5.4 Nonhomogeneous Poisson Processes 


The Poisson process considered thus far is characterized by a constant rate 4 at 
which events occur per unit time. A generalization of this is to suppose that the 
probability of exactly one event occurring in the interval (¢, f+ h] is A(@) - h+ o(h) 
for some function A(f). That is, we replace / in condition 2 of the preceding theorem 
with a nonnegative function A(7). It can then be shown that the number of events 
occurring during an interval (¢;, f2] has a Poisson distribution with mean 


to 


IK) —X(n)] = Fada (7.7) 


th 


The occurrence of events over time in this situation is called a nonhomogeneous 
Poisson process, and A(¢) is called the intensity function of the process. Notice that 
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the special case 1() = A (a constant) returns us to the usual, “homogeneous” case; in 
particular, from Eq. (7.7) we immediately have py(t) = At. 


Example 7.29 The article “Inference Based on Retrospective Ascertainment” 
(J. Amer. Statist. Assoc., 1989: 360-372) considers the intensity function 


A(t) _ ett 


as appropriate for events involving transmission of HIV (the AIDS virus) via blood 
transfusions. Suppose that a=2 and b=0.6 (close to values suggested in the 
paper), with time in years. What is the expected number of events in the first 
4 years? In the time interval (2, 6]? What is the probability that at most 20 events 
occur in first 18 months? 

To determine the expectation in any interval, we apply Eq. (7.7). The expected 
number of events in the interval (0, 4] is 


4 
E|x(4) — X(0)] = | et dp — 123,44, 

0 
while the expected number of events in (2, 6] equals le e* +6 dt — 409.82. Notice 
that the expected numbers of events for these two time intervals are quite different, 
even though each interval has length 4 years; this illustrates that a nonhomogeneous 
Poisson process does not have stationary increments. 

Finally, the number of events in the first 18 months (1.5 years), X(1.5), has a 

Poisson distribution with parameter 


pe = E[X(1.5)] = E[X(1.5) — X(0)] = I gr dt = 17975 


Therefore, P(X(1.5) < 20) = ¥22.9 e 17°°17.975*/x | = .733. = 


7.5.5 The Poisson Telegraphic Process 


We end this section with a brief discussion of the Poisson telegraphic process 
(or Poisson telegraph), a popular model in engineering for noise in a binary 
channel. Suppose we have events occurring according to the conditions of a Poisson 
process. Define a new random process, N(f), as follows: N(O) = —1 with probability 
5 and +1 with probability .5; when a random event occurs, N(f) switches parity (i.e., 
from —1 to +1 or vice versa). A sample function appears in Fig. 7.19; the x’s 
through the middle indicate the time occurrences of the random events (notice these 
are precisely where the process switches parity). 

The statistical properties of the Poisson telegraph are catalogued in the following 
proposition. 
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Fig. 7.19 One sample x(t) 
function of a Poisson A 
telegraph 1 7 
0 r > —— +> f 
-1 = at = = = = 
PROPOSITION 


Let N(Z) be a Poisson telegraphic process with rate parameter A. 

1. For all t>0, N(#) is +1 or —1 with probability .5 each. (Thus, a Poisson 
telegraph has the same distribution at all time-points.) 

2. wnt) =0 and op(t) = | for all t> 0. 

3. N(t) is WSS, and Ryy(t) = Cyn(t) =e 74", 


The proofs of these statements are left as exercises (see Exercises 69-70 at the 
end of this section). 

We more commonly think of the symbols 0 and 1 in binary communication, 
rather than —1 and +1. The Poisson telegraph described above can be easily 
modified: let N*() = .5[N(t)+ 1], so that N*(#) takes on the values 0 and 1. We 
call N*(2) a Poisson 0-1 telegraph. In Exercise 71 of this chapter, you are asked to 
derive the properties of the Poisson 0-1 telegraph. 


7.5.6 Exercises: Section 7.5 (53-72) 


53. The number of requests for assistance received by a towing service is a Poisson 
process with rate A= 4 per hour. 
(a) Compute the probability that exactly ten requests are received during a 
particular 2-h period. 
(b) If the operators of the towing service take a 30-min break for lunch, what is 
the probability that they do not miss any calls for assistance? 
(c) How many calls would you expect during their break? 
54. During the daily lunch rush, arrivals at the drive-thru at a nearby fast food 
restaurant follow a Poisson process with a rate of 0.8 customers per minute. 


75 


55. 


56. 


57. 


58. 
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(a) What is the expected number of customers in 1 h, and what is the 
corresponding standard deviation? 

(b) The drive-thru’s workers can’t handle more than 10 customers in any 5-min 
span. Determine the probability that too many customers arrive for the 
workers to handle between 12:15 p.m. and 12:20 p.m. 

(c) A customer has just arrived. What is the probability another customer will 
arrive within the next 30 s? 

(d) The 100th lunch customer, starting at 12:00 p.m., gets a free meal. What is 
the expected arrival time of that lucky customer, and what is the standard 
deviation of that time? 

Packets arrive at a certain node on the university’s intranet at 10 packets per 

minute, on average. Assume packet arrivals meet the assumptions of a Poisson 

process. 

(a) Calculate the probability that exactly 15 packets arrive in the next 2 min. 

(b) Find an expression for the probability that more than 75 packets arrive in 
the next 5 min. 

(c) Calculate the probability that the next packet will arrive in less than 15 s. 

(d) What is the average time between successive packet arrivals? 

(e) Calculate the probability that the fifth packet arrives in less than 45 s. 

The article “Reliability-Based Service-Life Assessment of Aging Concrete 

Structures” (J. Struct. Engrg., 1993: 1600-1621) suggests that a Poisson 

process can be used to represent the occurrence of structural loads over time. 

Suppose the mean time between occurrences of loads is .5 year. 

(a) How many loads can be expected to occur during a 2-year period? 

(b) What is the probability that more than five loads occur during a 2-year 
period? 

(c) How long must a time period be so that the probability of no loads 
occurring during that period is at most .1? 

Travelers arrive at an airport shuttle station according to a Poisson process with 

rate A. The shuttle vehicle will depart only once & travelers have arrived. 

Assuming that there are no travelers waiting at time 0, what is the expected 

duration of time until the next shuttle vehicle departs? 

The parking lot for a local ballpark has two entrances (east and west). In the 

hour before a game, cars entering the lot from east and west form two 

independent Poisson processes with rates 10 per minute and 15 per minute, 
respectively. 

(a) What is the expected number of cars entering the parking lot in any 10-min 
span, and what is the corresponding standard deviation? 

(b) In any particular minute, what is the probability exactly 12 cars enter from 
each side? 

(c) What is the probability that exactly 24 cars enter the lot in any particular 
minute? 

(d) Write an expression for the probability that, in any particular minute, the 
same number of cars enter through the east side and the west side. 
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62. 


63. 


64. 
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Orders are submitted to a certain online business according to a Poisson process 

with rate 3 orders per hour. 

(a) Given that 4 orders are submitted during the time interval [0, 2], what is the 
probability that 10 orders are submitted in the interval [0, 5]? 

(b) More generally, consider two fixed times s<t and two nonnegative 
integers m<n. Given that m orders are submitted by time s, what is the 
probability that n orders are submitted by time f? 

Automobiles arrive at a vehicle equipment inspection station according to a 

Poisson process with rate A= 10 per hour. Suppose that with probability .5 an 

arriving vehicle will have no equipment violations. 

(a) What is the probability that exactly ten vehicles arrive during the hour and 
all ten have no violations? 

(b) For any fixed y > 10, what is the probability that exactly y vehicles arrive 
during the hour, of which ten have no violations? 

(c) What is the probability that ten “no-violation” cars arrive during the next 
hour? [Hint: Sum the probabilities in (b) from y= 10 to co. ] 

A certain component is subject to electrical surges over time. Suppose that 

these surges occur according to a Poisson process with rate 2. Suppose also that 

with probability p, any particular surge will disable the component. What is the 
probability that the component survives (is not disabled) throughout the period 

[0, t]? [Hint: Make appropriate independence assumptions. ] 

Suppose events occur according to a Poisson process with rate A. 

(a) Given that n events have occurred in the interval [0, 7], what is probability 
that x of these events occurred in [0, 1]? [Hint: Let X(t) be the Poisson 
process, and write the conditional probability of interest in terms of X(f). 
Then apply the definition of conditional probability. ] 

(b) Given that 7 events have occurred in the interval [0, n], what is the limiting 
conditional distribution of the number of events in [0, 1] as n — co? 

There is one hospital at the northern end of a particular county and another 

hospital at the southern end of the county. Suppose that arrivals to each 

hospital’s emergency room occur according to a Poisson process with the 
same rate A and that the two arrival processes are independent of one another. 

Starting at time t=O, let Y be the elapsed time until at least one arrival has 

occurred at each of the two emergency rooms. Determine the probability 

distribution of Y. 

Suppose that flaws occur along a cable according to a Poisson process with 

parameter 2. A segment of this cable of length Y is removed, where Y has an 

exponential distribution with parameter 6. Determine the distribution of the 
number of flaws that occur in this random-length segment. [Hint: Let X be the 
number of flaws on this segment. Condition on Y=y to obtain P(X =xlY=y). 

Then “uncondition” using the Law of Total Probability (multiply by the pdf of 

Y and integrate). The gamma integral (3.5) will prove useful.] 

Starting at time t= 0, certain events occur at random with inter-arrival times 7), 

T>, and so on as in Fig. 7.18. Define X(t) = the number of arrivals in (0, ¢]; if we 
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assume the T,, are iid (but not necessarily exponentially distributed), then X(f) is 

called a renewal process. 

(a) Show that a renewal process whose inter-arrival times are iid exponential 
tvs is a Poisson process. That is, show that if the T,, are iid exponential(A) 
tvs then X(t) has a Poisson(A?) distribution. 

(b) The elementary renewal theorem states that, for any renewal process, 

E[x()|_ 1 


lim ——— = 
too t EIT, 


Show that this is trivially true for a Poisson process. 
[Note: A stronger version of the renewal theorem actually shows X(f)/t 
converges in probability to the constant 1/E[T,,].] 

Let X(t) count the number of events of a certain type in the time interval (0, ¢], 

and suppose X(f) can be modeled by a nonhomogeneous Poisson process with 

intensity function A(?). 

(a) Does X(f) have stationary increments? Why or why not? 

(b) Find the mean and variance functions of X(t). 

(c) What is the probability that no events occur in the time interval (0, ¢]? 

A certain repair facility is open for 8 h on a particular day. Customers arrive 

according to a nonhomogeneous Poisson process with rate (per hour) function 

A(t)=t forO<t<1,=1 for 1<t<7, and=8—tfor7<t<8. 

(a) What is the probability that no customers arrive in both the first and last 
hours and that 4 customers arrive in the middle 6 h? 

(b) What is the probability that the same number of customers arrive in the first 
hour, middle 6 h, and last hour? 

During the first round of enrollment, students begin registering for classes at the 

beginning of each hour. There’s a mad rush at the beginning of the hour, and 

then logins taper off. Let X(t) =the number of logins f minutes into the hour, 
and suppose X(t) can be modeled by a nonhomogeneous Poisson process with 
intensity function A() = 500/(t+ 1)° for 0<1t<60. 

(a) What is the expected number of students that will log into the registration 
system in the first 5 min of the hour? In the last 5 min of the hour? 

(b) What is the probability that no students log in during the last 5 min of an 
hour? 

(c) The registration system will crash if more than 450 students log in during 
any 5-min period. What is the probability that this occurs in the first 5 min 
of an hour? (You will need to use software or a Central Limit Theorem 
approximation to determine this probability.) 

Consider a Poisson telegraphic process N(t) with rate parameter /. 

(a) Let p= P(an even number of events occur in (0, ¢]). Explain why, for t > 0, 


P(N(t) = +1|N(0) =+1) =p and P(N(t)= +1 |N(0) =—1) =1—p. 
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(b) Use (a) and the Law of Total Probability to show that P(N(t) = +1) = .5 for 
all t> 0. (Since the only other possible value of N(f) is —1, this establishes 
property 1 of the last proposition of this section.) 

(c) Establish property 2 of the Poisson telegraph, i.e., that wy(4) = 0 and on(t) = 1 
for all t>0. 

70. (a) Consider a Poisson process with parameter 4. Show that the probability that 
an even number of events (0, 2, 4, ...) occurs in any interval (¢, f+ 7] is 
equal to (l+e ”*)/2. 

(b) Let M(t) be a Poisson telegraphic process with parameter 4. By considering 
the possible values of the product N(f)N(t +7), show that the autocorrela- 
tion function of N(f) is ee [Hint: Use (a).] 

71. A Poisson 0-1 telegraph N*(t) is constructed as follows: N*(0) equals 0 or 
1 with probability .5 each, and then N*(t) switches parity upon the occurrence 
of an event in a Poisson process. Find the pmf, mean, variance, autocovariance 
function, and autocorrelation function of N*(4). [Hint: Use the relationship 
N*(#) = %4[N(t) + 1], where N(‘) is an ordinary Poisson telegraph. ] 

72. This exercise outlines a proof that the time to the uth event of a Poisson process 
has an Erlang distribution. 

(a) Let Y,, denote the time to the nth event in a Poisson process. Argue that, for 
any time y > 0, P(Y,, > y) = P(fewer than n events occur in the time interval 
(0, y]). 

(b) Suppose the Poisson process has rate parameter A. Use (a) to write an 
expression for the cdf of Y,,. [Hint: the right-hand side of (a) can be written 
as a finite sum using the definition of a Poisson process.] 

(c) Differentiate your answer to part (b) to obtain the pdf of Y,,, and verify that 
it is an Erlang pdf with parameters n and A (aka the gamma distribution with 
a=nand P=1/A). 


7.6 Gaussian Processes 


We introduced the normal or Gaussian distribution in Chap. 3 and then extended it 
to a multivariate distribution in Chap. 4. Here, we consider the extension of the 
normal distribution to random processes. Engineers commonly use such models for 
noise in audio signals and the (seemingly) random motion of small particles. We’ ll 
explore both of these applications shortly. 


DEFINITION 
A random process X(t) is a Gaussian process if for all time points, t,, ..., ¢, 
the random variables X(t,),...,X(t,) have a multivariate normal distribution 
(as defined in Sect. 4.7). In particular, the distribution of X(‘) at any time point 
t is normal. 
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As discussed in Sect. 4.7, we can also characterize a joint Gaussian distribution 
by requiring that all linear combinations of the random variables be Gaussian. 
Applying that characterization here, we have an alternate definition of a Gaussian 
process: X(t) is a Gaussian process iff all linear combinations of X(f,), ..., X(t) 
have a normal distribution for n= 1, 2, 3, ..., and all time-points f,, ..., ¢,. 

In Sect. 7.3, we distinguished strict-sense stationary processes from wide-sense 
stationary processes. We noted that a strict-sense stationary process is automati- 
cally WSS, but not vice versa. However, suppose that a Gaussian process X(t) is 
WSS: this implies the mean and covariance structure of X(f) are time-invariant. But 
we know from Sect. 4.7 that mean and covariance completely characterize a joint 
Gaussian distribution; all other statistical properties can be derived from these two. 
Thus, if a Gaussian process is WSS, ail of its statistical properties must be time- 
invariant. 


PROPOSITION 
Suppose X(f) is a Gaussian process. Then X(f) is wide-sense stationary if, and 
only if, X(f) 1s strict-sense stationary. 


Example 7.30 The noise X(t) (measured in decibels) in an audio signal is modeled 
as a wide-sense stationary Gaussian process, with mean zero and autocovariance 
function 


Cxx(z) = 0.04e7!1/10 


Let’s first investigate the distributions of X(3) and X(8), the noise three and 
eight seconds into the audio signal, respectively. Since X(f) is a Gaussian process, 
by definition X(3) is a Gaussian random variable; we merely have to specify its 
mean and variance. We are given that X(f) is a mean-zero process, so in particular 
E[X(3)] = 0. We can extract the variance of X(3) from the autocovariance using a 
property of WSS processes: 


Var(X(3)) = 0% = Cxx(0) = 0.04e7!9I/9 — 09.04 


Therefore, X(3)~N(O, 0.2). Moreover, since X(t) is stationary, this is also the 
distribution of X(8). 

Next, notice that X(8) — X(3) is a linear combination of X(3) and X(8); therefore, 
since X(f) is a Gaussian process, the random variable X(8) — X(3) is also Gaussian. 
Its mean is simply E[X(8) — X(3)] =0 —0=0. Computing the variance takes a bit 
more effort: 
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Var (X(8) —X(3)) = Var(X(8)) + (—1)*Var(X(3)) +2(1) (—1)Cov(x(8),X(3)) 
= Var (X(8)) + Var(X(3)) — 2Cov(X(8),X(3)) 
=Cxx(0)+Cxx (0) —2Cyx(—5) since r=3-8=—5 
=0.04+0.04—2-0.04e7!-9!/10 


=0.08(1 —eW/?) = 0315 


Therefore, X(8) —X(3) ~N(0, V.0315). Finally, we can use this distribution to 
find the probability the noise at r=8 is more than 0.3 dB above the noise at t=3: 


P(X(8) > 0.3+X(3)) = P(X(8) — X(3) > 0.3) = 1— P(X(8) — X(3) < 0.3) 
0.3—0 


=1-© = 1-— (1.69) = .0455 
v.0315 ( ) * 


7.6.1. Brownian Motion 


In Example 7.4, we introduced the idea of Brownian motion, a model for the 
seemingly random behavior of a dust particle on a liquid surface. Physicists also 
use the Brownian motion model to describe a variety of physical processes, 
including the motion of some celestial bodies in response to gravitational forces. 
A precise mathematical construction of Brownian motion is beyond the scope of 
this book; however, we characterize Brownian motion in the following definition. 


DEFINITION 

A (one-dimensional) Brownian motion process (also called a Wiener 
process) with parameter a > 0, denoted B(f), is a Gaussian process with the 
following properties: 

1. pg(t) =0; that is, Brownian motion is a mean-zero process. 

2. op(t) =at, so og(t) = Vat. 

3. B(t) has stationary and independent increments. 

If a= 1, B(#) is called standard Brownian motion.° 


It can be shown that B(t+ 7) — B(t) ~ N(0, fat) for any 7 >0, reflecting the 
stationary increments property. Figure 7.20 shows several sample functions of 


> Albert Einstein showed in 1905 from physical considerations that the conditional pdf of B(to +0) 
given B(fo) =x must satisfy the partial differential equation Of /Ot = 5a- 0?f/0x°, where the 
“diffusion constant” a involves a gas constant, temperature, a coefficient of friction, and 
Avogadro’s number. He also showed that the unique solution to this PDE is the normal pdf. 
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Fig. 7.20 Brownian motion x(t) 


Brownian motion. It is important to note that while B(f) has stationary increments, it 
is not itself a stationary process (the variance of Brownian motion depends on ¢). 

Because Brownian motion is not a stationary process, we expect the 
autocovariance function will depend on “absolute” time (¢ and s) rather than 
“relative” time zt. It can be shown that 


Caa(t, 5) = Rep(t,s) = a- min(t, 5) 


The derivation is similar to that of the Poisson process autocovariance function in 
the previous section. 

Brownian motion actually shares several features with the Poisson process of the 
previous section: both have initial value 0 (with probability 1), stationary and 
independent increments, and variance proportional to time. In fact, it can be 
shown (see Exercise 85 at the end of this section) that any random process with 
constant initial value along with stationary and independent increments must 
necessarily have a variance function that’s linear in ¢. 

Since Brownian motion is a one-dimensional random process, clearly it can only 
describe particle motion in a single direction. The random motion of particles on a 
surface or through space can be described by 2- or 3-dimensional Brownian motion 
processes, for which it’s assumed the motion along each dimensional axis is an 
independent, one-dimensional Brownian motion process. 


Example 7.31 Consider the movement of a particle along a single axis, governed 
by Brownian motion with parameter a = 4. Let’s begin by identifying the probability 
distribution of the particle’s displacement from time f= 1 s to time r=4 s. If we 
write B(t) for the process, then we wish to know the distribution of B(4) — B(1). 
Applying the comment below the definition of Brownian motion with z= 4 — 1=3, 
we have B(4) — B(1) ~ N(0, V12). (An alternative derivation uses a similar 
approach to Example 7.30 and the autocovariance function mentioned above.) 
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The particle’s displacement from time t=2 s to time f=5 s has this same 
distribution, since both increments span a time length of t=3 and, by Property 
3 of the definition, B(¢) has stationary increments. However, the increments B(5) — 
B(2) and B(4) — B(1) are not independent: while Property 3 states that Brownian 
motion has independent increments, the two time intervals (2, 5] and (1, 4] overlap. 

Finally, what is the probability that the particle is displaced more than 10 units in 
the time interval (1, 4]? Since the question does not indicate whether the displace- 
ment is positive or negative (relative to the axis), we’re really interested in 
determining P(IB(4) — B(1)| > 10). Because the distribution of B(4) — B(1) is sym- 
metric about 0, we may proceed as follows: 


P(|B(4) — B(1)| > 10) = 2P(B(4) — B(1) > 10) by symmetry 
= 2{1 — P(B(4) — B(1) < 10)] 


=9|{=@ |= — = 2}1 — 0(2.89)| = .0038 


7.6.2. Brownian Motion as a Limit 


The Brownian motion process described above can actually be constructed as the 
limit of a discrete-time random process—specifically, the simple symmetric ran- 
dom walk S,, of Example 7.24. We will shrink both the time increment and the size 
of a jump in this random walk as follows: for some > 0 and Ax >0, suppose at 
times h, 2h, 3h, etc. the walk moves +Ax or —Ax with probability .5 each. Then, 
with [ ] denoting the greatest integer function, the random process B(f) defined by 


B(t) = (Ax)X1 + +++ + (Ax)X jay = (AX) Stim 


indicates the location of the random walk at time ¢. The coefficient Ax changes the 

motion increment from +1 to+ Ax; the time index n= [t/h] equals the number of 

moves (equivalently, the number of h-second time intervals) in the interval [0, f]. 
From the properties of the random walk, 


tp(t) = E[AxSi/n] = Ax(0) = 0 and o3(t) = (Ax)?Var(Siyn)) = (Ax)? - [¢/h] 
= (Ax)’[t/A] 


Moreover, the Central Limit Theorem tells us that, for large values of [#/h], the 
distribution of B(f) is approximately normal. 

Up to now, the choices of h and Ax have been arbitrary. But suppose we choose 
Ax = Vah for some a>0. Then, if we shrink h to 0 (effectively moving from 
discrete time to continuous time), B(t) will be normally distributed with mean 0 and 
variance 
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lim ah|t/h| = at 
h—0 | / 


The properties of independent and stationary increments clearly follow as 
consequences of the iid steps in the random walk. Thus, B(t) becomes a Brownian 
motion process as h — 0. 


7.6.3 Further Properties of Brownian Motion 


Consider some fixed value x9 > 0 and a fixed time fo. The maximum value of B(t) 
during the time interval 0 < f < fg is a random variable M. What is the probability that 
M exceeds the threshold xo? Figure 7.21 shows two sample paths, one for which the 
level xo is exceeded prior to fg and one for which this does not occur. 

Let’s focus on a path b(t) that does reach x9 during the specified time interval. 
Corresponding to this path, we now create a new path by reflecting about the line 
y =X the part of b(f) that lies to the right of the first time it reaches x9. Denote the 
first time at which the original path reaches x) by T. Then for t< T the new path— 
call it b*(t)—1s identical to the original path. But for any time t>T, it’s easy to 
show that the reflected path is given by b*(t) = 2x9 — b(t). The original path and its 
reflected path are illustrated in Fig. 7.22. 

Notice that the maximum level for each of these paths exceeds xo, and exactly 
one of these two paths has level exceeding xo at time fo. That is, for every sample 
path that exceeds level xo at time fo, there are two sample paths whose maxima on 
[0, to] exceed xo, the original path and the reflected path. Put another way, for each 


Fig. 7.21 A sample path for x(t) 
hich M = Bt r A M 
whic No (t) > Xo yf 
and a path for which M < Xo . 
Xo + 32 : 
> 
Fig. 7.22 A sample path A 


crossing xX before time fp and 
its paired reflected path 
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pair of sample paths whose level exceeds x9 some time before fo, one being the 
reflection of the other about x9 subsequent to its first “hitting time,” there is exactly 
one sample path satisfying B(to) > Xo. 

Now given that B(T)=xo, consider the pdf of the level B(t) at some time 
subsequent to T. Because Brownian motion has independent increments, the pro- 
cess “begins anew” at time T, except that its Gaussian behavior starts at xp rather 
than at 0. The symmetry of the normal distribution implies that the pdf at a level 
above xo at the future time is the same as the pdf at a level that is below xo by the 
same amount. That is, the original path and the reflected path are equally likely. 
Melding this equally likely property with the result of the previous paragraph, 
establishing a one to one correspondence between pairs of reflected paths crossing 
Xo and paths whose level exceeds xo at time fo, gives the following result. 


PROPOSITION 
Let B(t) be Brownian motion and, for fo > 0, let M = max B(t). Then 
StSIo 


P(M > x9) = 2P(B(to) > xo) = 2 f we o()| (7.8) 


The second equality in the proposition comes from the fact that 
B(to) ~ N(O, ato). Replacing x9 with m on both sides of Eq. (7.8) and 
differentiating with respect to M, we can determine the pdf of this rv: 


2 
fu(m) = eo” /(2a0) am >O 


/ 2nato 


The foregoing proposition also allows us to obtain the distribution of the random 
variable T = the first time at which the process hits level xo. To see this, note that a 
sample path will first hit level x) before or at time fo iff the maximum level of the 
path during the time interval from 0 to fg is at least xo. In symbols, T < fo iff M > Xo. 
Since the probability of the latter event is what appears in the last proposition box, 
we immediately have the following result. 


PROPOSITION 
Let T be the first time that a Brownian motion process reaches level xo. Then 


AS HUP <= 2)1 -0(=.)], 


from which it follows that 


fr(t) = ee 3/293 / 2a t>0 


V2na 
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Fig. 7.23 pdf of the hitting —_f;,(¢) 
time, T, for Brownian motion A 


U > 


Figure 7.23 shows the probability distribution of the “hitting time” T. Exercise 
84 asks you to show this is a valid probability distribution and to derive the pdf from 
the cdf. 


7.6.4 Variations on Brownian Motion 


Let B(t) denote a standard Brownian motion process (i.e., with a= 1), and let yw and 
o >0 be constants. Brownian motion with drift is the process X(t) = ut+ oB(t). 
X(t) also has stationary an independent increments; an increment X(t+7) — X(f) is 
normally distributed with mean jz and variance o*z. Brownian motion with drift 
has many interesting applications and properties. For example, suppose the drift 
parameter yj is negative. Then over time X(f) will tend toward ever lower values. It 
can be shown that M, the maximum level attained over all time f>0, has an 
exponential distribution with parameter 2\ul/o°. 

Standard Brownian motion and Brownian motion with drift both allow for 
positive and negative values of the process. Thus they are not typically acceptable 
models for the behavior of the price of some asset over time. A stochastic process 
Z(t) is called geometric Brownian motion with drift parameter a if X(t) = 
In[Z(] is a Brownian motion process with drift having mean parameter 
p=a- o”/2 and standard deviation parameter o. Since Z(t) = exp(X(f)), a geomet- 
ric Brownian motion process will be nonnegative. Any particular sample path will 
show random fluctuations about a long-term exponential decay or growth curve. 

Geometric Brownian motion is a popular model for the pricing of assets. Let X(a) 
be the price of an asset at time ¢. The ratio X(f)/X(0) is the proportion by which the 
price has increased or decreased between time 0 and time ¢. In the same way that we 
obtained Brownian motion as a limit of a simple symmetric random walk, geomet- 
ric Brownian motion can be obtained as a limit in which the price at each time point 
either increases by a multiplicative factor u or goes down by another particular 
multiplicative factor d. The limit is taken as the number of price changes during 
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(0, ¢] gets arbitrarily large, while the factors u and d get closer and closer to 1 and 
the two probabilities associated with u and d approach .5. Geometric Brownian 
motion is the basis for the famous Black-Scholes option pricing formula that is used 
extensively in quantitative finance. This formula specifies a fair price for a contract 
allowing an investor to purchase an asset at some future time point for a particular 
price (e.g., a contract permitting an investor to purchase 100 shares of Facebook 
stock at a price of $20 per share 3 months from now). 


7.6.5 Exercises: Section 7.6 (73-85) 


73. Let X(t) be a wide-sense stationary Gaussian process with mean py = 13 and 
autocovariance function Cy y(z) = 9cos(z/5). 
(a) Calculate P(X(10) <5). 

(b) Calculate P(X(10) < X(8) +2). 

74. Let Y(t) be a WSS Gaussian process with mean fy = —5 and autocorrelation 
function Ryy(t) = (2517 + 34)/(1 + 77). Determine each of the following: 
(a) Var(Y(t)) 

(b) P(Y(2) > 5) 
(c) PUY(2)I>5) 
(d) P(Y(6) — ¥(2) > 5) 

75. The voltage noise M(t) in a certain analog signal is modeled by a Gaussian 
process with mean 0 V and autocorrelation function Ryj(t, s)=1—It— sl/10 
for It— sl < 10 (and zero otherwise). 

(a) Is M(t) stationary? How can you tell? 

(b) Determine P(IN(t)I > 1). 

(c) Determine P(IN(t+5) — N(OI > 1). 

(d) Determine P(IN(t+ 15) — NOI > 1). 

[Hint for (b)—(d): Does your answer depend on f?] 

76. The text Gaussian Processes for Machine Learning (2nd ed., 2006) discusses 
applications of the “regression-style” model Y(t)=fo+/,t+X() +e, where 
fo and f, are constants, the “error term” ¢ is a N(O, o) rv, and X(¢) is a Gaussian 
random process with mean 0 and covariance function 


Cxx(t, 5) = etsy 


for suitable choices of the parameters x > 0 and J > 0. X(f) and € are assumed to 

be independent. 

(a) Is X(t) wide-sense stationary? 

(b) Is Y(4) a Gaussian process? 

(c) Determine the mean, variance, and autocovariance functions of Y(t). Is Y(t) 
WSS? 

(d) What effect does the parameter x have on Y(t)? That is, how would the 
behavior of Y(t) be different for large x versus small x? 

(e) What effect does the parameter / have on Y(t)? 
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77. 


78. 


79. 


80. 


81. 


Consider the following model for the temperature X(f), in °F, measured t hours 
after midnight on August 1, in Bakersfield, CA: 


X(t) = 80+ 20 cos (5 (1 15) +B(0, 


where B(t) is a Brownian motion process with parameter a =.2. 

(a) Determine the mean and variance functions of X(t). Interpret these 
functions in the context of the example. 

(b) According to this model, what is the probability that the temperature at 
3 pm on August | will exceed 102 °F? 

(c) Repeat part (b) for 3 p.m. on August 5. 

(d) What is the probability that the temperatures at 3 p.m. on August 1 and 
August 5 will be within 1 °F of each other? 

Brownian motion is sometimes used in finance to model short-term asset price 

fluctuation. Suppose the price (in dollars) of a barrel of crude oil varies 

according to a Brownian motion process; specifically, suppose the change in 

a barrel’s price ¢ days from now is modeled by Brownian motion B(f) with 

a=.15. 

(a) Find the probability that the price of a barrel of crude oil has changed by 
more than $1, in either direction, after 5 days. 

(b) Repeat (a) for a time interval of 10 days. 

(c) Given that the price has increased by $1 in seven days, what is the 
probability the price will be another dollar higher after an additional 
seven days? 

Refer to the weather model in Exercise 77. Suppose a meteorologist uses the 

mean function of X(t) as her weather forecast over the next week. 

(a) Over the next five days, what is the probability that the actual temperature 
in Bakersfield will exceed the meteorologist’s prediction by more than 
5 °F? [Hint: What part of X(t) represents her prediction error?] 

(b) What is the probability that the actual temperature will exceed her predic- 
tion by 5 °F for the first time by midnight on August 3 (i.e., two days after 
t=0)? 

Refer to Exercise 78, and suppose the initial price of crude oil is $110 per 

barrel. 

(a) Over the next 30 days, what is the probability the maximum price of a 
barrel of crude oil will exceed $115? 

(b) Determine the probability that the price of crude oil will hit $120 for the 
first time within the next 60 days. 

The motion of a particle in two dimensions (e.g., a dust particle on a liquid 
surface) can be modeled by using Brownian motion in each direction, horizon- 
tal and vertical. That is, if (X(#), Y(t)) denotes the position of a particle at time ¢, 
starting at (0, 0), we assume X(t) and Y(f) are independent Brownian motion 
processes with common parameter a. This is sometimes called two-dimen- 
sional Brownian motion. 
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82. 


83. 


84. 
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(a) Suppose a certain particle moves in accordance with two-dimensional 
Brownian motion with parameter a=5. Find the probability the particle 
is more than 3 units away from (0, 0) in each dimension at time t= 2. 

(b) For the particle in (a), find the probability the particle is more than 3 units 
away from (0, 0) radially (i.e., by Euclidean distance) at time t= 2. [Hint: 
It can be shown that the sum of squares of two independent N(0, o) rvs has 
an exponential distribution with parameter 2 = 1/(20°).] 

(c) Three-dimensional Brownian motion, a model for particulate movement in 
space, assumes that each location coordinate (X(‘), Y(4), Z(f)) is an indepen- 
dent Brownian motion process with common parameter a. Suppose a 
certain particle’s motion follows three-dimensional Brownian motion 
with parameter a= 0.2. Find the probability that (1) the particle is more 
than 1 unit away from (0, 0, 0) in each dimension at time t= 4, and (2) the 
particle is more than | unit away from (0, 0, 0) radially at time t= 4. [Hint: 
The sum of squares of three independent N(0, o) rvs has a gamma distribu- 
tion with a= 3/2 and 6 =2o07.] 

Some forms of thermal voltage noise can be modeled by an Ornstein- 

Uhlenbeck process X(t), which is the solution to the “stochastic differential 

equation” X’(t) + «X(t) = oB'(t), where B(t) is standard Brownian motion and x, 

o >0 are constants. With the initial condition X(0)= 0, It can be shown that 

X(t) is a Gaussian process with mean 0 and autocovariance function 


2 
Cxx(t,s) = - jen*-4 enw s+) 


(a) Is the Ornstein-Uhlenbeck process wide-sense stationary? Why or why not? 
(b) Find the variance of X(#). What happens to the variance of X(t) as t— co? 
(c) Let s=t+z. What happens to Cyx(t, f+ 7) as t— co? 

(d) For s >t, determine the conditional distribution of X(s) given X(f). 

A Gaussian white noise process is a Gaussian process N(f) with mean f(t) = 0 and 
autocorrelation function Ryy(t) = (No/2)5(z), where Ny > 0 is a constant and 
5(z) is the Dirac delta function (see Appendix B). 

(a) Is Gaussian white noise a stationary process? 

(b) Define a new random process, X(f), as the integrated version of M(t): 


Find the mean and autocorrelation functions of X(¢). Is X(t) stationary? 
Consider the “hitting-time” distribution F7(t) = 2[1 — ®(xo/Vat)], > 0, for 
Brownian motion presented in the last proposition of this section. 


7.7 Continuous-Time Markov Processes 663 


(a) Show that F7(t) is a valid cdf for a nonnegative rv by proving that 
(1) F7(t) — 0 as t— 0", (2) F7() — 1 as t— 00, and (3) F7(A) is an increas- 
ing function of f. 

(b) Find the median of this hitting-time distribution. 

(c) Derive the pdf of T from the cdf. 

(d) Does the mean of this hitting-time distribution exist? 

85. Let X(t) be a random process with stationary and independent increments and 
X(O) a constant. 
(a) Take the variance of both sides of the expression 


X(t+ 7) — X(0) = [X(t+ t) — X(t)] + [X(t) — X(0)] 


and use the properties of X(f) to show that Var(X(t+7)) = Var(X() + 
Var(X(t)). 

(b) The only solution to the functional relation g(t + 7) = g(t) + g(z) is a linear 
function: g(t)=at for some constant a. Apply this fact to part (a) to 
conclude that any random process with a constant initial value and station- 
ary and independent increments must have linear variance. (This includes 
both Brownian motion as well as the Poisson counting process of the 
previous section.) 


7.7 Continuous-Time Markov Processes 


Recall from Chap. 6 that a discrete-time Markov chain is a sequence of random 
variables Xo, X,, Xz, ... satisfying the Markov property on some state space 
(typically a set of integers). In this section, we continue to assume that the state 
space consists of either a finite or infinite set of integers, but now the index set of 
possible subscripts consists of all t for which t > 0. Thus we have not only X5 = the 
state of the process at time t=5 and Xj. =the state of the process at time 12, but 
also X¢,5, the state of the process at time f= 6.5, X27.249, and so on. For example, X, 
might be the number of customers in a service facility of some sort, where t= 0 is 
the time at which the facility opened; we then track the number of customers in 
continuous time rather than just at times 0, 1, 2, and so on. To be consistent with the 
notation of Chap. 6, we will write the random process as X,, with time as a subscript, 
but we could just as well use the X(t) notation from earlier sections of this chapter. 

As before, the Markov property says that once we know the state of the process 
at some time f, the probability distribution of future states does not depend on the 
state of the process at any time prior to ¢. That is, given that X,=7, the values of X,, 
for u >t do not depend on the values of X,, for u<¢t. In particular, whenever we 
have times t] <tp<<...<ty,<t<uU, 


Pye, Hayy, SH Se Fa a 


In general, the transition probabilities P(X,, = j/IX,=i) might depend not only 
on the time increment u — ¢ but also on ¢ itself. For example, it might be the case that 
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P(X19 = /|X7.5 =i) is different from P(X 5.5 = /lX;3 =7) even though the increment is 
2.5 time units in both expressions. As in the discrete-time case, we will assume 
throughout this section that our Markov processes are time homogeneous, i.e., for 
any time increment h > 0, the transition probability P(X,,,; = jlX;=1) depends on 
h but not on ¢, so that we may write 


Pi(h) = P(Xt4n = 1X = i) 


Thus P(A) is the conditional probability that the state of the process h time units 
into the future will be j given that the process is presently in state i. 

Paralleling the discrete-time case, here we also have the Chapman—Kolmogorov 
equations; these describe how P;(t+h) =P(X;4,=jlXo =i) is obtained by condi- 
tioning on the state of the process after f time units have elapsed: 


Pi(t+h) = P (process is in state jaftert + h time units|now in state i) 


= bar (process is in state k after ¢ time units and in 
statej after h additional units|now in state i) 


~ yo Ker = j|X, = k,Xo =i) - P(X, = k| Xo =i) 


= iP Ke =x, =k)- P(X, = k|Xo = 1) 


= i (t ) + Pa (h) 


The second-to-last equality is by virtue of the Markov property: conditional on 
the process being in state k at time ¢, the state at any previous time is irrelevant to the 
chance of being in state 7 at the future time. 


7.7.1 Transition Rates and Sojourn Times 


It is reasonable to assume that each transition probability P;() is a continuous 
function of f, so that such probabilities change smoothly as ¢ does. Since P;(0) = 1 
and P;(0)=0 for i#j, this implies that as h approaches 0, P;(h) approaches 1 and 

Ph) approaches 0 when i#/j. Rather amazingly, it turns out that all P,(t) are 
differentiable, and in particular are differentiable at 0: 


P,,(h) — Pi (hy) — 
ii(h) P;(0) im Pii(h) 1 
h—0 h h—0 


(it is convenient to denote this derivative by —q; because the numerator P;(h) — 1 is 
negative and so the limit itself will be negative; g; is then positive), and 
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Pi(h) 


h=0 h h0 h 


= Wij fori # j 


In our development of the Poisson process Sect. 7.5, we employed o(/) notation 
to represent a quantity that for small / is negligible compared to h (see also 
Appendix B). Using this notation in combination with the preceding derivative 
expressions, we have that for / close to 0, 


P(Xi+n = i|X; = i) = Py(h) = 1 — jh + o(h) 


P(Xr44 =J|X; = i) = Py(h) qyhto(h) iFj 


A continuous-time Markov process is characterized by these various transition 
probability derivatives at t=0, which are collectively called the infinitesimal 
parameters of the process. 

Because a process that’s in state / at time ¢ must be somewhere at time ¢ +h, 
» Pi) = | for any i. Taking the derivative of both sides and evaluating at 0 gives 
a simple relationship between q; and the various q,js for i# fe 


1 = DU Pilh) = Pilh) + D0 Pil) > 
0 = P;(h) + Pal”) = 
0 = P,,(0) + re) = Gt ii = 


qi = ete 


Example 7.32 Our old friend the Poisson process with rate parameter / is an 
especially simple case of a continuous-time Markov process. After remaining in 
state 7 for a time that is exponentially distributed with parameter A, the only possible 
transition is to state i+ 1 when the next event occurs. So, P;() = P(no events in an 
interval of length h) =e *”, whence P;; (h) = Ae *"and g; = — P;; (0) =A. Similarly, 
P;341(h) = 1 — P(mo events in interval) = 1 — a” implies g; 341 = Pig 1(0) =A, and 
qi = 0 for j £i+ 1. Notice that these values indeed satisfy Eq. (7.9). = 


In the preceding example, it was determined that g;;+1= Pi; +1(0) =A. In light of 
the fact that 2 represents the rate at which events occur (i.e., the count increases by 
1), this seems quite reasonable—after all, we know from calculus that a derivative 
also represents a rate of change. In general, we can interpret the q,;s in this fashion: 
qi Tepresents the rate at which a Markov process transitions from state i to state 
j Over some very short time interval. Hence, the q,js are called the instantaneous 
transition rates of the Markov process. 


© When the state space is infinite in extent, it can sometimes happen that g;= 00. This will not 
occur for the situations considered herein. 
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Example 7.33. A nurse checks in on patients in three hospital rooms and also spends 
time at the nurses’ station. Identify these “states” as 0=nurses’ station and 1, 2, 
3 = the three patient rooms. Let X, denote the nurse’s location ¢ hours into her shift. 
Figure 7.24 shows an example of the nurse’s transitions between these four states 
across continuous time. In this figure, she begins her shift at the nurse’s station 
(Xo =0), spends some time there, and then moves periodically from room to room. 


Fig. 7.24 One realization of YX, { 


the Markov process X;, in 
Example 7.33 
3+ 
2+ 
14 
0 > 


Suppose the nurse walks from the nurses’ station to room 3 an average of twice 
per hour; this is the rate at which she transitions from state 0 to state 3. But the 
foregoing discussion also indicated that the derivative g;; represents the rate of 
change from state / to state 7. Therefore, we have that go3 = 2 (two such transitions 
per hour). A complete description of the nurse’s movements requires specifying all 
of the other instantaneous transition rates as well; let’s say those are 


doi =4 qo2= -5 qo3 = 2 
qio=3 q2=4 q3= 1 
420 = 3 q2u=-5 q23=4 
430 = 3 q31 =9 q32= 1 


The remaining four infinitesimal parameters of this Markov process model can 
be determined using Eq. (7.9). For example, 


a dicot = 91 +42 +93 =4+.5+2=65 
Similarly, gj =8, g2=7.5, and g3=4. - 


The parameters qo, g1, ... are not transition rates like the qs, since they are 
associated with time intervals in which the process stays in the same state. (Unlike 
in the case of discrete-time Markov chains from Chap. 6, here we do not speak of 
“transitions” from a state into itself.) Rather, an interpretation of the g,s will be 
provided by the main theorem of this section, coming up shortly. 
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7.7.2. Sojourn Times and Transitions 


The time durations spent in various states are important features of a Markov 
process. A continuous time interval spent in one state is called a sojourn time of 
the process. In Fig. 7.24, five sojourn times are visible: the nurse spends a while at 
her station (state 0), then some time in patient room 1, then back at her station, over 
into room 3, and finally into room 2. These sojourn times are clearly continuous 
random variables, but what are their distributions? 

Think back to Example 7.32. From the results of Sect. 7.5, the distribution of 
time that a Poisson process spends in state i before moving to state i+ 1— that is, the 
sojourn time between the occurrence of the ith event and the (i+ 1)st event—is 
exponential with parameter 2. Our next theorem says that sojourn times in any 
continuous-time Markov process are also exponentially distributed, and specifies 
the probabilities of moving to various other states once a state transition occurs. 


THEOREM 


1. A sojourn time for state i of a continuous-time Markov chain has an 
exponential distribution, with parameter A= q;. 

2. Once a sojourn in state 7 has ended, the process next moves to a particular 
state ji with probability g;/q;. 


Proof Let’s first consider the distribution of T=sojourn time in state 7, and in 
particular the probability that this sojourn time is at least t+ h, where h is very small. 
In order for this event to occur, the process must remain in state i continuously 
throughout the time period of length f and then continue in this state for an 
additional hf time units. That is, with F denoting the cumulative distribution 
function of T and G(x) = 1 — F(), 


G(t+h)=1-—F(t+h)=P(T >t+h) =P(X, =i for O<u<tand for t<u<t+h) 
=P(X,=i for t<u<t+h|X, =i for O<u<t)-P(X, =i for 0<u<t) 
=P(X, =i for t<u<t+h|X,=i)-P(X,=i for 0<u<t) 

by the Markov property 

=P(X, =i for t<u<t+h|X,=i)-P(T>12) 

=P(X, =i for t<u<t+h|X,;=i)-G(t) 


Now, the probability P(X,,=i for t<u<t+hlX,=/) is not quite P;(h), because 
the latter includes both the chance of remaining in state i throughout [f, t+] and 
also of making multiple transitions that bring the process back to state i by the end 
of this time interval. But because h is small, the probability of two or more 
transitions is negligible compared to the likelihood of either making a single 
transition (to some other state 7) or remaining in state i. That is, these two 
probabilities differ by a term that is o(h). Therefore we have 
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G(t+h) = [Pu(h) + o(h)] -G(t) = [1 — gjh] -G@) + o(h) > 


Gea “) =) __ 4 a 


Taking the limit as  — 0 results in the differential equation G’(t) = —g;, whose 
solution is G(t) = e~%'. Therefore, the cdf of T is F(t) = 1 — e~4, an exponential 
cdf with parameter q;. This proves the first part of the theorem. 

For the second part, we consider the probability that the process is in state j after 
a short interval of time, given that it is in state 7 at the beginning of that interval and 
is not in state i at the end of the interval: 


P(Xt+h =j,X; = i,Xr+h F i) 
P(X; — i,Xt4h # i) 


P(X, =i.Xen= 
mace ae ) because {X14 =j,Xi+n i} = {Xin =J} 


P (Xin =J|X1=i.X4n Fi) = 


If we now divide both numerator and denominator by / and take the limit as 
h approaches 0, the result is P(next in jlcurrently in 1) =q,/q;, as asserted. a 


Example 7.34 (Example 7.33 continued) According to the preceding theorem, the 
time intervals spent by the nurse at the nurses’ station are exponentially distributed 
with parameter A = qo = 6.5. Hence, the average length of time she spends there is 
1/A=1/6.5 h= 9.23 min. Similarly, the average sojourn time in patient room 3 is 
1/q3 = 1/4 h=15 min. When the nurse leaves her station, the likelihoods that she 
next visits rooms 1, 2, and 3 are, respectively, 


qi_ 4 _ 8 qo 3 _ q3_ 2 _ 4 
dog 65 13° do 65 13° q 65 13 


Similarly, when the nurse leaves patient room 1, there is a 3/8 chance she’ll 
return to the nurses’ station, a 4/8 probability of moving on to room 2, and a 1/8 
chance of checking the patients in room 3. 

Notice that we could also obtain these probabilities by rescaling the appropriate 
row of the array in Example 7.33 so that the entries sum to 1. In general, the 
transition probabilities exiting a sojourn spent in state 7 are proportional to the 
instantaneous transition rates out of state 7. a 


Example 7.35 Consider a machine that goes back and forth between working 
condition (state 0) and needing repair (state 1). Suppose that the duration of working 
condition time is exponential with parameter a and the duration of repair time is 
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exponential with parameter /. From the preceding theorem, gy=a and q,;=/. 
Equation (7.9) then implies that go; = a@ and gio =f, from which we infer that 


Poi (h) = ah+ o(h) and Pio(h) = Bh + o(h) 


That is, for very small values of , the chance of transitioning from working 
condition to needing repair in the next / time units is roughly ah, while P(working 
at time t+/| being repaired at time ft) © fh for h small. 

Notice also that once a sojourn in the working state (i = 0) has ended, the process 
moves to the repair state (j= 1) with probability ¢o\/¢9 =a/a=1. This makes 
sense, since a machine leaving the working condition has nowhere else to go except 
into repair. The same is true if the roles of 7 and j are reversed. = 


Example 7.36 A commercial printer has four machines of a certain type. Because 
there are only 3 employees trained to operate this kind of machine, at most three of 
the four can be in operation at any given time. Once a machine starts to operate, the 
time until it fails is exponentially distributed with parameter a (so the mean time 
until failure is 1/a). There are unfortunately only two employees who can repair 
these machines, each of whom works on just one machine at a time. So if three 
machines need repair at any given time, only two of these will be undergoing repair, 
and if all four machines need repair, two will be waiting to start the repair process. 
Time necessary to repair a machine is exponentially distributed with parameter / 
(thus mean time to repair is 1/f). 

Let X, be the number of machines that are in operation at time ¢. Possible values 
of X;, (i.e., the states of the process) are 0, 1, 2, 3, and 4. If the system is currently in 
state 1, 2, 3, or 4, one possible state transition results from one of the working 
machines suddenly breaking down. Alternatively, if the system is currently in state 
0, 1, 2, or 3, the next transition might result from one of the machines in repair 
finishing the repair process. These possible transitions are depicted in the state 
diagram in Fig. 7.25. 

The eight non-zero instantaneous transition probabilities must be determined. 
Two of them follow the derivation from the Poisson process in Example 7.32: 

dio=a@ [i.e., Pio(h) =ah+o(h), corresponding to the one working machine 

going down] 

934 = P [in state 3, only one machine is currently being repaired] 

Next, consider the transition from 2 working machines to just 1. For a time 
interval of length h, 
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Fig. 7.25 State diagram for Example 7.36 


Po\(h) = P(Xrsh =] |X, = 2) — Pil" working machine breaks down U 
24 breaks down) 
= P(1“ breaks) + P(2"™ breaks) — P (both break) 
= (1 | | (1 ee) (1 am 


The term (1 — e~”) comes from the fact that for an exponentially distributed 


wT, PT <t+hIT>t)=1—e . Differentiating and substituting h =0 gives 
921 = P,,(0) = "+= 2a 


When exactly two machines are working, the instantaneous failure rate is twice 
that of a single machine (because, in effect, twice as many things can go wrong). 

By similar reasoning, q3.=3a, and also g43=3a, because although four 
machines are in operating condition, only three are actually operating. Finally, 
qdo1 = 2f (none of the machines are working, but only two are undergoing repair), 
and likewise g\2 = q23 = 2f. 

From Eq. (7.9), the parameters of the exponential sojourn distributions are 


go=2P qre=at+2B q=2a+26 qg=3at+Pp qy=3a 


So, for example, the length of a time interval in which exactly three machines are 
operating has an exponential distribution with 2 = 3a + #, and the expected duration 
of such an interval is 1/(3a + f). | 


A continuous-time Markov chain for which the only possible transitions from 
state i are either to state i— 1 or state i+ 1 is called a birth and death process. The 
Poisson process is an example of a pure birth process—no deaths are allowed. In 
Example 7.36, a birth occurs when a machine finishes repair, and a death occurs 
when a machine breaks down. Thus, starting from state 0 only a birth is possible, 
starting from state 4 only a death is possible, and either a birth or a death is possible 
when starting from state 1, 2, or 3. 


7.7.3 Long-Run Behavior of Markov Processes 


Consider first a continuous-time Markov chain for which the state space is finite and 


consists of the states 0, 1, 2,3, ..., N. Then we already know that 
i i ie i | 
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Let’s now create a matrix of these parameters—the exponential sojourn 
parameters and the instantaneous transition rates—in which the diagonal elements 
are the —q;s and the off-diagonal elements are the q;;s. Here is the matrix in the case 
N=4: 


—9do0 1 42 403 404 

fio —% 412 413 414 

Q=] d0 41 —42 93 404 
930 «©6731 932— 93 434 

q4o 441 442 443 44 


Equation (7.9) implies that the sum of every row in this matrix of parameters is 
zero (since each gq; is the sum of the other g;s in its row). This matrix Q is 
sometimes called the generator matrix of the Markov process. 

Next, define a transition matrix P(‘) whose (i, /)th entry is the transition probability 
P;{t) = P(X, = jlXo = 1). Analogous to the discrete case, the Chapman—Kolmogorov 
equations can be rendered in terms of the transition matrix: P(#¢+h)= P()P(h). 
We now consider the derivative of the transition matrix at time f: 


From the earlier derivatives, the limit of the matrix inside braces is precisely the 
generator matrix Q. Thus we obtain the following system of so-called “forward” 
differential equations in the transition probabilities: 


P(t) = P(t)Q (7.10) 


where P’(f) is the matrix of derivatives of the transition probabilities. 

As in the discrete case, if the chain is irreducible, 1.e., all states communicate 
with one another, then P;(¢) > 0 for every pair of states and lim, _, ,.P;(t) exists and 
equals a value a; independent of the initial state. Thus as t > oo, P’(t) approaches a 
matrix consisting entirely of Os (because the probabilities themselves are 
approaching constants independent of f) and P(t) itself approaches a matrix each 
of whose rows is 1 = [Zo, 7), ..., Zy]. Applying these statements to Eq. (7.10) and 
taking the top row (or any row) of each side, the vector of stationary probabilities 
must then satisfy 0 = 2Q, as well as >’ ;= 1. Slight rearrangement of the equations 
gives 


X0d0 = 71910 + 42920 T +--+ + ANGNG 
Td, = M0401 T 72921 +--+ + ANGNy 


NIN = HOGon + M191n +--+» + HN-19N-1,N 
l=at+a,+m4+...+aNn 


Consider the first of these equations. The left hand side gives the long-run rate at 
which the process leaves state 0, and the right hand side is the sum of rates at which 
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the process goes from some other state to state 0. So, the equation says that the long- 
run rate out of that state equals the long-run rate into the state. The other equations 
have analogous interpretations. 


Example 7.37 (Example 7.36 continued) The generator matrix for the printing 
machines scenario is 


~2p 2p 0 0 0 
a —(a+26) 2p 0 0 
Q=| 0 2a —(2a + 28) 2p OO 
0 0 3a -GBat+p) B 
0 0 0 3a —3a 


The stationary distribution of the chain satisfies 0=2Q. Expanding these 
matrices, the resulting system of equations is 


—2Pn + an, = 0,2Pm0 — (a + 28)m + 2am = 0,26m — (2a4+ 2f)m2 + 3an3 =0 
2Bn2 — (3a + B)n3 + 3an4 = 0, Baz — 3am, =0 


The first equation immediately gives 2,;=(2f/a)ao. Then substituting this 
expression for z, into the second equation and doing a bit of algebra results in 
12 = (26° /a*) a0. Now substitute this expression into the third equation and solve for 
m3 in terms of zo, and then obtain an expression for z4 in terms of zo. The stationary 
probabilities are now 


Finally, the fact that the sum of all five zs equals | gives an expression for 0: 


1 


26, 2h | 46 4p" 
142+ 8 4+84+85 


To = 


Consider two different specific cases: (1) a=1, B=2, (2) a=2, B=1. In the 
first case, machines get repaired more quickly than they fail, and in the second case 
the opposite is true. Here are the stationary probabilities: 

(1) ao = .0325, a, = .1300, 22 = .2600, 273 = .3466, 24 = .2310 
(2) mo= 3711, a =.3711, m2 = .1856, 73 = .0619, 24 = .0103 

In the first case, the mean number of machines in operating condition is 

> ix; = 2.614, and in the second case it is only .969. a 


Under quite general conditions, the forward system of differential equations 
(7.10) is valid for a birth and death process even when the state space is infinite (i.e., 
when there is no upper bound on the population size). Furthermore, the stationary 
distribution exists and has a rather simple form. Let 
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Py l, ops 901 , Bee 401412 Bem foi 12423 (7.11) 
110 910921 910921932 
Then 2 = 0,0, 12 = O20, 13 = O30, .. ., and 1p = 1/Z0; provided that the sum in 


the denominator is finite. 


Example 7.38 Customers arrive at a service facility according to a Poisson process 
with rate parameter A (so the times between successive arrivals are independent and 
exponentially distributed, each with parameter 1). The facility has only one server, 
and the service time for any particular customer is exponentially distributed with 
parameter y. This is often referred to as an M/M/1 queue, where M stands for 
Markovian. Let X, represent the number of customers in the system at time f. 

The mean time between successive arrivals is 1/A, and the mean time for a 
service to be completed is 1/y. Intuitively if 1/u > 1/2 (.e., if uw < A), then customers 
will begin to pile up in the system and there won’t be a limiting distribution because 
the number of customers in the system will grow arbitrarily large over time. We 
therefore restrict consideration to the case 2 <y (the case in which A= yp is a bit 
tricky). 

The infinitesimal birth parameters are q;;,; =A for i=0, 1, 2,..., since a birth 
occurs when a new customer enters the facility, and the infinitesimal death 
parameters are gj+1,=p for i=1, 2,3, ..., since a death occurs when a customer 
finishes service. Substituting into Eq. (7.11), 


6,=—~- k=0,1,2,3,..., m= =1 
v 


we 
my = moth = (1 )(2) k=0,1,2,3,... 
H) \u 


This is similar to a geometric distribution with p = 1 — A/p, except that the terms 
start at k=O rather than k= 1. Nevertheless, we can quickly determine that the 
mean number of customers in the system is Lka,= (A/w)/A —A/w) =A (u— A). 


7.7.4 Explicit Form of the Transition Matrix 


The forward system of differential equations was obtained by decomposing the 
time interval from 0 to f+/ into the interval from 0 to ¢ and the interval from ¢ to 
tth. A “backward” system of equations results from considering the two 
intervals [0, A] and (h, t+h] and again using the Chapman—Kolmogorov 
equations: P(t+h)=P(h)P(). The derivative of the transition matrix at time ¢ is 
then 
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ia EPO OP {im ofr 
h—0 h h—0 h h—0 


The matrix limit is again Q, giving the following system of equations: 


P(t) = QP(2) (7.12) 


Contrast the “backward” equation (7.12) with the “forward” equation (7.10): the 
two matrices on the right-hand side are simply reversed. Of course, in general 
matrices do not commute, so one equation does not follow from the other; that both 
QP(t) and P(t)Q equal P’(r) is a special property of Markov processes. 

Now recall from calculus that a solution to the equation f’(f) = cf(t) is fit) =e", 
and also that the infinite series expansion for e is 1+ 577° ,c‘/k!. By analogy, the 
solution to our system of backward equations (7.12) is 


Example 7.39 (Example 7.35 continued) Let’s return to the scenario involving a 
single machine which is either working or undergoing repair, where time until 
failure has an exponential distribution with parameter a and repair time is exponen- 
tially distributed with parameter /. The matrix of infinitesimal parameters is 


o=|75 5] 


It is easily verified that Q‘ = [—(a+)]*_'Q, from which 


We now have a completely explicit formula for the transition probabilities of the 
Markov process for any time duration ¢. Notice that the sum of each row in the 
transition matrix is 1, as required. 

This explicit form of P(r) also allows us to investigate the chain’s long-run 
behavior. Specifically, as f— oo, 


7.7 Continuous-Time Markov Processes 675 


7.) g_[A/la+A) a/(a+p) 
PO I+ pe | B/(a+) a/(a+s) 


Thus the stationary distribution is given by 79> =f/(a+f) and 2;=a/(a+f), 


which could also have been obtained by solving tQ =0, a9 +2, =1. | 


7.7.5 Exercises: Section 7.7 (86-97) 


86. 


87. 


88. 


The authors of the article “A Multi-State Markov Model for a Short-Term 
Reliability Analysis of a Power Generating Unit” (Reliab. Eng. and Sys. Safety, 
2012: 1-6) modeled the transitions of a particular coal-fired generating unit 
through four states, characterized by the unit’s capacity: 0 = complete failure, 
(0 MW of power), 1 =247 MW, 2=482 MW, and 3 = 575 MW (full power). 
Observation of the unit over an extended period of time yielded the following 
instantaneous transition rates: 


doi = .0800 qdo2 = .0133 do3 = 0 
dio = .0294 gix= 3235 qy3=.0294 
q20=0 go = .0288 903 = 3558 
qx9=.0002 3 =.0001 gq = .0007 


(a) Determine the complete generator matrix Q of this Markov process. 

(b) Determine the stationary probabilities of this process. 

(c) What is the long-run expected output of this particular unit, in megawatts? 

Potential customers arrive at a service facility according to a Poisson process 

with rate 2. However, an arrival will enter the facility only if there is no one 

already being served, and otherwise will disappear (there is no waiting room!). 

Once a customer enters the facility, service is carried out in two stages. The 

time to complete the first stage of service is exponentially distributed with 

parameter 4,. A customer completing the first stage of service immediately 

enters the second stage, where the distribution of time to complete service is 

exponential with parameter 9. 

(a) Define appropriate states, and then identify the g;s and q,s. 

(b) Determine the stationary probabilities when A= 1, 2, = 3, A. =2. 

(c) Determine the stationary probabilities when 1 = 1, A; = 2, A. =3. 

(d) Determine the stationary probabilities when A= 4, A, = 2, A. = 1. 

Return to the scenario of the previous exercise, and now suppose that the 

facility has a waiting area that will accommodate one customer. A customer 

in the waiting area cannot begin the first stage of service until the previous 

customer has completed both stages. 

(a) Define appropriate states, and then identify the g,s and q,;s. [Hint: The chain 
now has five possible states. ] 

(b) Determine the stationary probabilities when A= 1, 2, = 3, A. =2. 

(c) Determine the stationary probabilities when 1 = 1, A; = 2, 4. =3. 

(d) Determine the stationary probabilities when A= 4, 2, =2, d= 1 


> 
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Reconsider the scenario of Exercise 87. Now suppose that a customer who 
finishes stage 2 service leaves the facility with probability .8, but with proba- 
bility .2 returns to stage 1 for rework because of deficient service and then 
proceeds again to stage 2. 

(a) Define appropriate states, and then identify the g;s and q,s. 

(b) Determine the stationary probabilities when A= 1, A, = 3, A. =2. 

(c) Determine the stationary probabilities when 1 = 1, A; = 2, A. =3. 

(d) Determine the stationary probabilities when A= 4, 4, = 2, A. = 1. 

(e) What is the expected total time that a customer remains in the facility once 
he/she has entered? 

The Yule Process is a special case of a birth and death process in which only 

births occur; each member of the population at time ¢ has probability £h + o(h) 

of giving birth to an additional member during a short time interval of length 

h independently of what happens to any other member of the population at that 

time (so there is no interaction among population members). Let X, denote the 

population size at time f. 

(a) Show that if the population size is currently n, then the probability of a birth 
in the next interval of length h is nfh + o(h), and that the probability of no 
births in the next interval of length h is 1 — nfh+o(h). 

(b) Relate P;(t+h) to the transition probabilities at time ¢, and take an appro- 
priate limit to establish a differential equation for P;(¢). [Hint: If Xi4n=J 
and h is small, there are only two possible values for X,. Your answer 
should relate P;,(t), P,(t), and P; ;-;(0).] 

(c) Assuming that there is one individual alive at time 0, show that a solution to 
the differential equation in (b) is P,,(t) =e “(1 —e ”)”~ !. (In fact, this is 
the only solution satisfying the initial condition.) 

(d) Determine the expected population size at the f, assuming Xo= 1. [Hint: 
What type of probability distribution is P,,(¢)?] 

Another special case of a birth and death process involves a population 
consisting of N individuals. At time t=0 exactly one of these individuals is 
infected with a particular disease, and the other N—1 are candidates for 
acquiring the disease (susceptibles). Once infected, an individual remains so 
forever. In any short interval of time fA, the probability that any particular 
infected individual will transmit the disease to any particular non-diseased 
individual is 6h+o(h). Let X, represent the number of infected individuals at 
time f. Specify the birth parameters for this process. [Hint: Use the differential 
equation from the last exercise. ] 

At time t=O there are N individuals in a population. Let X, represent the 

number of individuals alive at time ¢. A linear pure death process is one in 

which the probability that any particular individual alive at time ¢ dies in a short 
interval of length A is 6h+o(h); no births can occur, deaths occur indepen- 
dently, and there is no immigration into the population. 

(a) Obtain a differential equation for the transition probabilities of this process, 


and then show that the solution is Py,(t) = @ ) ew) - aca ae 
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(b) What is the expected population size at time f¢? [Hint: According to (a), 
what type of probability distribution is Py,,(t)?] 
A radioactive substance emits particles over time according to a Poisson 
process with parameter A. Each emitted particle has an exponentially 
distributed lifetime with parameter /, and the lifetime of any particular particle 
is independent of that of any other particle. Let X, be the number of particles 
that exist at time t. Assuming that Xp =0, specify the parameters of this birth 
and death process. 
Consider a machine shop that has three machines of a particular type. The time 
until any one of these machines fails is exponentially distributed with mean 
lifetime 10 h, and machines fail independently of one another. The shop has a 
single individual capable of repairing these machines. Once a machine fails, it 
will immediately begin service provided that the other two machines are still 
working; otherwise it will wait in a repair queue until the repair person has 
finished work on any other machines that need service. Time to repair is 
exponentially distributed with expected repair time 2 h. Obtain the stationary 
probabilities and determine the expected number of machines operating under 
stationary conditions. 
A system consists of two independent components connected in parallel, so the 
system will function as long as at least one of the components functions. 
Component A has an exponentially distributed lifetime with parameter apo. 
Once it fails, it immediately goes into repair, and its repair time is exponentially 
distributed with parameter a). Similarly, component B has an exponentially 
distributed lifetime with parameter fp and an exponentially distributed repair 
time with parameter /,. Determine the stationary probabilities for the 
corresponding continuous time Markov chain, and then the probability that 
the system is operating. 
The article “Optimal Preventive Maintenance Rate for Best Availability with 
Hypo-Exponential Failure Distribution” (EE Trans. on Reliability, 2013: 
351-361) describes the following model for maintenance of a particular 
machine. The machine naturally has three states: 0= “up” (i.e., fully opera- 
tional), 1 = first stage degraded, and 2 = second stage degraded. A machine in 
state 2 requires corrective maintenance, which restores the machine to the “up” 
state. But the machine’s operators can voluntarily put a machine currently in 
states 0 or | into one other state, 3 = preventive maintenance. The cited article 
gives the following instantaneous transition rates: 


qo =/1 qo2 = 9 qo3 =6 
q0=9 12 =A2 q3=6 
q20 =H q21 =9 423 =9 


go=m q31 =9 932 =9 
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The parameter 6 > 0 is called the trigger rate for preventive maintenance and is 
controlled by the machine’s operator. 

(a) Draw a state diagram for this Markov process. 

(b) Interpret the parameters 1, 2, uw, and m. 

(c) Determine the stationary distribution of this chain. 

The machine can be operated in both states 0 and 1, and so the availability 
of the machine, A(6), is defined to be the sum of the stationary probabilities 
for those two states. 

(d) Show that 
5 Ajay = 


A(6) = }14 t 
(6) m pay +A. +6) 


(e) Determine the value of 6 that maximizes the long-run proportion of time the 
machine is available for use. [Hint: You'll have to consider two separate 
cases, depending on whether a certain quadratic equation has any positive 
solutions. ] 

A discrete-time Markov chain, i.e., the type investigated in Chap. 6, can be 

obtained from a continuous-time chain by sampling it every h time units. That 

is, forn=O0, 1, 2,... we define 


Yn _ Xnhs 


where X, is a Markov process. For example, the nurse’s movements in Example 
7.33 could be observed every 6 min (= 1/10 h), and a discrete-time Markov 
chain could be defined by Y,, = X19 = the nurse’s location at the nth observed 
time. 

(a) Let P be the one-step transition matrix for Y,,, so the (i, /)th entry of P is 
Pig= PV ni =jlY, =i). Show that py ~qyh for iAj and p+ — qh, 
where the q,S and g;s are the infinitesimal parameters of X, and the 
approximations are on the order o(h). 

Suppose Y,, is a regular chain, and that the one-step transition probabilities 
in part (a) are exact (rather than just o(f)-approximate). Show that the 
stationary distribution of Y,, is identical to that of its continuous-time 
version X,. [Hint: Use part (a) to show that the equations xP =a from 
Chap. 6 and 7Q =0 from this section are one and the same.] 


(b 


we 
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Supplementary Exercises (98-114) 


Let X(#) be a WSS random process. 
(a) Show that 


Var[X(¢+ 7) — X(0)] = EL(X( +1) —X(t))’} =2[Rxx(0) — Ryx(2)] 


(b) Show that if Cyx(d) =Cyx(0) for any d#0, then X(f) is mean square 
periodic, i.e., E[(X(t + d) — X(t))"] =0. 

(c) Show that if Cyx(d)=Cx,(0) for any d#0, then Cx x(z) is periodic. 
(A similar property holds for Ryx.) [Hint: Consider the covariance of 
X(0) and X(r+d)— X(t), and use the fact that ICov(U, V)|<SD(U)- 
SD(V) for any two rvs U and V.] 

Consider the following model for binary voltage noise: let V;, V2, ... be 

independent rvs with P(V,, = +1) = P(V,, = —1) =.5. Then define X(t) = V,, for 

n—1<t<n, te., V; is transmitted for O0<t<1, V2 is transmitted for 
1<t< 2, and so on. 

(a) Find the mean function of X(¢). 

(b) Find the autocovariance function of X(t). [Hint: Consider separate 
cases depending on whether or not ¢ and s lie in the same unit interval, 
e.g., [1, 2).] 

Modify the previous exercise as follows: let To ~ Unif[0, 1] be independent of 

the V,,s. Then the random process X(t) equals V; for To<t<To+1, V2 is 

transmitted for Tg + 1 <t<7 +2, and so on. 

(a) Find the mean function of X(‘). 

(b) Find the autocorrelation function of X(t). [Hint: First find the conditional 
distribution of X(f) and X(s) given Ty = fo.] 

Define a collection of random processes by X;,(t) = A,cos(@,¢) + B,sin(@,¢) for 

k=1,2,...,n, where the coefficients A,,...,A,,B,,...,B, are iid Unif[—1, 1] 

rvs, and the frequencies @, ..., @, are constants. Let Y(t) =X (++ X,,(0). 

(a) Find the mean and autocovariance functions of X,(t) for k= 1, 2, ..., n. 

(b) Find the mean and autocovariance functions of Y(t). Is Y(t) WSS? 

Let ©;, ..., ©, be iid Unif(—z, 1] rvs and define a random process X(t) by 


X(t) = So ay sin(@xt + @x) 
k=1 


Is X(t) wide-sense stationary? 

Let X() =cos(Qt+®), where Q and © are independent rvs, © ~ Unif(—z, x], 
and Q equals @; with probability p, for k= 1, 2,...,n (i.e., Q is a discrete rv). 
(a) Find the mean function of X(‘). 

(b) Find the autocovariance function of X(). 

(c) Is X(t) WSS? 
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104. Let X(t) be a WSS random process, and let Y(4) = X(t — d), a d-second delayed 
version of X(t). 
(a) Find the mean and autocorrelation functions of Y(t) in terms of those 
of X(t). 
(b) Is Y(t) WSS? 
(c) Find the cross-correlation Ryy(t, +7). Are X(t) and Y(t) jointly WSS? 
105. A rotor within a certain manufacturing machine must be replaced every 125 h 
of use, on average. Let X,, denote the lifetime of the nth rotor (n= 1, 2,3, ...), 
and suppose the X,,s are iid exponential rvs with mean 125 h. (In this context, 
the “time” index n actually counts rotors, not hours or some other time unit.) 
(a) Define S,,=X,+---+X,, Interpret S,, in this context. 
(b) Find the mean, variance, autocorrelation, and autocovariance functions of 
Sn 
(c) Use the Central Limit Theorem to determine the approximate distribution 
of S59 and to approximate P(Ss9 > 6240), the chance that 50 rotors will be 
sufficient to operate the machine for 3 years (40 h per week, 52 weeks 
a year). 
106. Let X(4) be a WSS random process with mean py and autocovariance function 
Cxx(t). 
(a) Show that E[(X(t))7]= yx for all T. [Note: Since this is the ensemble 
average of X(t), it follows that X(t) is mean ergodic iff Var((X(1))7) — 0 as 
T-o.] 
(b) It is straightforward to show that 


Var((X(t);)) = =|. ie Cxx(s — t)dtds 


Make the substitution t= s — ¢ to prove 


var(X()n) = anf Cw (1a 


so that X(t) is mean ergodic iff this integral converges to 0 as T— oo. 
[This can be a useful test for ergodicity when a model is specified in terms 


of its covariance function and no explicit form of X(f) is available. ] 
2T 


1 
(c) Show that X(f) is mean ergodic if =| Cxx(t)dt — 0as T> cw. 
=O 


107. Let X,, be a WSS random sequence, and define Y,,=X,,—X,_1. Is Y, also 
WSS? 
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Let X,, be iid, with mean 0 and variance o*. Define a kth-order moving average 
sequence Y,, by 


Yn = Xq t+ + &Xn—K41 


where the nonnegative constants a; are such that a; +---+a,=1. 

(a) Find the mean function of Y,,. 

(b) Find the variance function of Y,,. 

(c) Find the autocovariance function of Y,,. 

(d) Is Y,, wide-sense stationary? 

(e) Find the correlation coefficient p(Y,,, Yn+i). 

Suppose that noise impulses occur on a telephone line at random, with a mean 

rate 2 per second. Assume the occurrence of noise impulses meet the 

conditions of a Poisson process. 

(a) Find the probability that no noise impulses occur during the transmission 
of a t-second message. 

(b) Suppose that the message is encoded so that errors caused by a single 
noise impulse can be corrected. What is the probability that a t-second 
message is either error-free or correctable? 

(c) Suppose the error correction protocols can reset themselves so long as 
successive noise impulses are more than e seconds apart. What is the 
probability the next noise impulse will be corrected? 

A bus has just departed from a certain New York City bus stop. Passengers for 

the next bus arrive according to a Poisson process with rate 3 per minute. 

Suppose the arrival time Y of the next bus has a uniform distribution on the 

interval [0, 5]. 

(a) Given that Y= y, what is the expected number of passengers at the stop for 
this next bus? 

(b) Use the result of (a) along with the Law of Total Expectation to determine 
the expected number of passengers at this stop when the next bus arrives. 

(c) Given that Y=y, determine the (conditional) variance of the number of 
passengers at the stop for this next bus. Then use the Law of Total 
Variance to determine the standard deviation of the number of passengers 
at this stop for the next bus. 

Starting at time fr=0, commuters arrive at a subway station according to a 

Poisson process with rate 2 per minute. The subway fare is $2. Suppose this 

fare is “exponentially discounted” back to time 0; that is, if a commuter 

arrives at time ¢, the resulting discounted fare is 2e “, where ais the “discount 
rate.” 

(a) If five commuters arrive in the first 9 minutes, what is the expected value 
of the total discounted fare collected from these five individuals? [Hint: 
Recall that for a Poisson process, conditional on any particular number of 
events occurring in some time interval, each event occurrence time is 
uniformly distributed over that interval.] 
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(b) What is the expected value of the total discounted fare collected from 
customers who arrive in the first tg minutes? [Hint: Conditioning on the 
number of commuters who arrive, use an expected value argument like 
that employed in (a), and then apply the Law of Total Probability.] 

Individuals enter a museum exhibit according to a Poisson process with rate A. 

The amount of time any particular individual spends in this exhibit is a 

random variable having an exponential distribution with parameter 0, and 

these exhibit-viewing times are independent of one another. Let Y(t) denote 
the number of individuals who have entered the exhibit prior to time ¢ and are 
still viewing the exhibit, and let Z(f) denote the number of individuals who 

have entered the exhibit and departed by time f. 

(a) Obtain an expression for P(Y(t) = 6 and Z(t) = 4). 

(b) Generalize the argument leading to the expression of (a) to obtain the joint 
pmf of the two random variables Y(¢) and Z(f). 

According to the article “Reliability Evaluation of Hard Disk Drive Failures 

Based on Counting Processes” (Reliability Engr. and System Safety, 2013: 

110-118), particles accumulating on a disk drive come from two sources, one 

external and the other internal. The article proposed a model in which the 

internal source contains a number of loose particles M having a Poisson 
distribution with mean value yw; when a loose particle releases, it immediately 
enters the drive, and the release times are iid with cumulative distribution 
function G(t). Let X(4) denote the number of loose particles not yet released at 

a particular time f. Show that X(f) has a Poisson distribution with parameter 

bu — G(O)]. (Hint: Let YA) denote the number of particles accumulated on the 

drive from the internal source by time ¢, so that X(4)+ Y(t) =M. Obtain an 

expression for P(X(f) =x, Y(t) =), and then sum over y.] 

Suppose the strength of a system is a nonnegative rv Y with pdf g(y). The 

system experiences shocks over time according to a Poisson process with rate 

A. Let X; denote the magnitude of the ith shock, and suppose the X;s are iid 

with cdf F(x). If when the ith shock occurs, X; > Y, then the system immedi- 

ately fails; otherwise it continues to operate as though nothing happened. Let 

S(2) denote the number of shocks in [0, ¢], and let T denote the system lifetime. 

(a) Determine the probability P(T > “1Y = y and S(t) =n). [Hint: Your answer 
should involve y, n, and the cdf F.] 

(b) Apply the Law of Total Probability along with (a), to determine P(T > ¢l 
Y=y). 

(c) Obtain an integral expression for the probability that the system lifetime 
exceeds ¢. [Hint: Write P(T > ft) as a double integral involving the joint pdf 
of T and Y. Then simplify to a single integral using (b).] 

(Based on the article “On Some Comparisons of Lifetimes for Reliability 

Analysis,” Reliability Engr. and Safety Analysis, 2013: 300-304.) 


Introduction to Signal Processing 


The previous chapter introduced the concept of a random process and explored in 
depth the temporal (i.e., time-related) properties of such processes. Many of the 
specific random processes introduced in Chap. 7 are used in modern engineering to 
model noise or other unpredictable phenomena in signal communications. In this 
chapter, we investigate the frequency-related properties of random processes, with a 
particular emphasis on power and filtering. 

Section 8.1 introduces the power spectral density, which describes how the 
power in a random signal is distributed across all possible frequencies. This first 
section also discusses so-called white noise processes, which are best described in 
terms of a frequency distribution. In Sect. 8.2, we look at filters; or, more precisely, 
linear, time-invariant (LTI) systems. We explore some techniques for filtering 
random signals, including the use of so-called “ideal” filters. Finally, Sect. 8.3 
reexamines these topics in the context of discrete-time signals. 

We assume throughout this chapter that readers have some familiarity with 
(nonrandom) signals and frequency representations. In particular, knowledge of 
Fourier transforms and LTI systems will be critical to understanding our exposition. 
Appendix B includes a brief summary of the properties of Fourier transforms; 
Sect. 8.2 includes a short discussion of LTI systems. 


8.1 Power Spectral Density 


In Chap. 7, we considered numerous models for random processes X(f) as well as 
several ways to quantify the statistical properties of such processes (the mean, 
variance, autocovariance, and autocorrelation functions). All of these statistical 
functions describe the behavior of X(f) in the time domain. Now we turn our 
attention to properties of a random process that can be described in the frequency 
domain. 

At the outset, some basic notation and conventions must be established. First, the 


letter j will denote /—1 in order to be consistent with engineering practice (some 
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readers may be more familiar with the symbol 7). Second, we will denote frequency 
by f, whose units are Hertz (1/s). For those more familiar with radian frequency @, 
the two are of course related by w= 2xf. Third, throughout this chapter X(f) will 
represent a random current waveform through a 1-Q impedance. This is a standard 
convention in signal processing; it has the advantage that we can talk about current 
and voltage interchangeably (since V = /R). Finally, we will assume that all random 
processes are wide-sense stationary (WSS) unless otherwise noted, because this is a 
key assumption for the main theorem of this section. 

Our ultimate goal is to describe how the power in a random process is distributed 
across the frequency spectrum. From basic electrical engineering, we know 
P= PR, where P= power, /=current=X(f), and R=resistance= 1 Q. Hence, 
we may think of PR=X°(t) as the “instantaneous power” in the random process 
at time ¢. 


DEFINITION 
Let X(t) be a WSS random process. The (ensemble) average power (also 
called the expected power) of X(t), denoted by Py, is 


P= oe) 
The average power of X(f) is related to its autocorrelation function by 


Bee) 


Notice we may write Py rather than P,(), i.e., the ensemble average power in 
X(t) does not vary with time. This is due to the assumption of wide-sense 
stationarity. To see why Py equals Ryx(0), recall that for WSS processes we have 
Ryx(t) = E[X(H)X(t+7)], which does not depend on ¢. Setting t=0 immediately 
gives Ryy(0) = E[X°*(1)] = Py. 


Example 8.1 In Chap. 7, we introduced the “phase variation” random process 
X(t) = Aocos(@ot+@), where the phase shift © is uniformly distributed on the 
interval (—1, 2]. We showed that X(t) is WSS, with mean py = 0 and autocovariance 
function Cyy(t) = (Ad/2)cos(@ot), from which Ryx(t) = (Ao /2)cos(@ot) as well. 
Thus, the ensemble average power in the phase variation process is 


Ae Ab 
Py = Rxx(0) =" 608 (wo - 0) aa 


This formula for the average power of a sinusoid is well known to electrical 
engineers. a 
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Now we turn to describing how the expected power Py in a random process is 
distributed across the frequency domain. For example, is this power concentrated 
at just a few frequencies, or across a very large frequency band? Typically in 
engineering, we move from the time domain ¢ to the frequency domain f by taking 
the Fourier transform of our time-dependent function. Because of some technical 
issues related to the existence of certain integrals that arise in connection with 
random processes, we must proceed carefully here. To begin, define a truncated 
version of a random process X(f) by 


x)= a (| <7 


0 otherwise 


This function is square-integrable with respect to f, and so its Fourier transform 
exists. Define 


‘00 


Fr(f) = F(X} =| XrQe Par = | “X(pePat 


Parseval’s Theorem then connects the integrals of X;7(f) and Fr(/f): 
[2 lFr(f)P df= [% X(N? dt= f".X°(0) dt, where the absolute value bars 
denote the magnitude of a possibly complex number. (Since X°(t) is real-valued 
and nonnegative, those bars may be dropped.) Divide both sides by 2T: 


[ Pl =f age? i X(t) dt (8.1) 


Since the right-most expression in Eq. (8.1) gives the average power in X(f) 
across the interval [—T, T], so does the far left term, and it follows that the integrand 
IF;(f)P?/2T describes how that average power is distributed in the frequency 
domain. In fact, IF7(f iF has units of energy, and so the units on IF;(f \P?/2T are 
energy/time = power. We still need to remove the truncation of the original X(¢), 
and it is desirable to take the ensemble average of this power representation. 


DEFINITION 
The power spectral density (psd), or power spectrum, of a random process 
X(t) is defined by 
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As may be evident by the preceding development, applying this definition is 
typically extremely difficult in practice. Thankfully, for wide-sense stationary 
processes, there is a simpler method for calculating the power spectral density 
Syx(f). The formula, presented in the following theorem, is hinted at by the fact 
that the average power itself can be found through the autocorrelation function, 
Px =Ryxx(0), as noted before. It was first discovered by Albert Einstein but is more 
commonly attributed to Norbert Wiener and Aleksandr Khinchin. 


WIENER-KHINCHIN THEOREM 
If X(Z) is a wide-sense stationary random process, then 


Sxx(f) = F {Rxx(z)} 


A proof of this theorem appears at the end of this section. 


Example 8.2 Let X(f) =220cos(2000at+ ©), an example of the phase variation 
process from Example 8.1 (with Ag = 220 and wp = 2000). The expected power of 
this signal is Ae /2 = 24,200 = 24.2 kW. It’s clear from the formula for X(¢) that it 
broadcasts this 24.2 kW signal at a single frequency of 2000z radians, or 1 kHz. 
Thus, we anticipate that all the power in X(d) is concentrated at 1 kHz. To verify 
this, use the autocorrelation function from Example 8.1 and apply the Wiener— 
Khinchin Theorem: 


2 
Sxx(f) = F {Rxx(r)} = 7\2 cos (oon)} = F {24,200 cos (2000zr) } 


Use the linear property of Fourier transforms, and then apply the known trans- 
form of the cosine function: 


Sxx(f) = 24,200.F { cos (2000z7) } 
= 24,200 - ; [5(f — 1000) + &(f + 1000) | 


= 12,100[8(f — 1000) + 8(f + 1000)] 


where 65() denotes an impulse function (see Appendix B for more information). 
Figure 8.1 shows this power spectral density, which consists of two impulses 
located at +1000 Hz, each with intensity 12,100. Of course, in practice, the 
frequency — 1000 Hz is really the same as +1000 Hz, and so the power spectrum 
of this random process in the positive frequency domain is located solely 
at 1000 Hz; the impulse at this lone frequency carries intensity 12,100 
+ 12,100 = 24,200, the ensemble average power of the signal. 
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Syx(f) 
(12,100) (12,100) 
a 
—1000 1000 f 
Fig. 8.1 The power spectral density of Example 8.2 a 


Because the Fourier transform results in part of the power spectral density being 
represented at negative frequencies, the psd is sometimes called a two-sided power 
spectrum. Next we will explore this property and others more thoroughly. 


8.1.1 Properties of the Power Spectral Density 


The following proposition describes several basic properties of Syy(f) and 
indicates how the psd is related to average power. 


PROPOSITION 

Let Syy(f) be the power spectral density of a WSS random process X(‘). 

1. Syx(f) is real-valued and nonnegative. 

2. Syx(f) is an even function, ie., Syx(—f)=Syx(f). (This is the “two- 
sided” nature of the psd.) 


3h. | Sxx(f) df = Px, the ensemble average power in X(f). 


—oo 


Proof Property 1 follows from the definition of the psd: even though the Fourier 
transform F;(f) may be complex-valued, IF (f \P?/2T must be real and nonnega- 
tive. Since Syx(f) is the limit of the expected value of IF;(f \P/2T, it must also be 
real and nonnegative. 

To prove property 2, we invoke the Wiener—Khinchin Theorem. Since Ryy(t) is 
even, we can simplify its Fourier transform: 


10.0) 00 


Rax(s)eP* ar = | Ryx(z)cos(2nft)dr (8.2) 


—0oo 


Sex(f) = Reet} = | 

The sine component of the complex exponential drops out because Ryx(t) is 
even. From Eq. (8.2) it is clear that Syx(f) is both real-valued and an even function 
of f, since cosine is even. 
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Property 3 also follows from the Wiener—Khinchin Theorem: writing the 
autocorrelation function as the inverse Fourier transform of the pdf, we have 


Rxx(t) = F"{Sxx(f)} = [- Sxx( fle?" df = 


Sxx(fet?™ af = [ Sxx(f)df 


—oo 


Px = Ryxx(0) = | 


—o0o 


The foregoing proposition gives some insight into the interpretation of the power 
spectral density. As stated previously, Syy(f) describes how the ensemble average 
power in X(f) is distributed across the frequency domain. Since power must be real 
and nonnegative, so must the power spectrum. Property 3 shows why Syx(f) is 
called a “density”: if we integrate this function across its entire domain, we recover 
the total (expected) power in the signal, Px, much in the same way that integrating a 
pdf from —oo to oo returns the total probability of 1. Property 3 also indicates the 
appropriate units for the psd: since the integral is performed with respect to f (Hertz) 
and the result is power (watts), the correct units for the power spectral density are 
watts per Hertz (W/Hz). 

Property 2 makes precise the two-sided nature of a psd. The power spectrum 
Sxx(f) will always be symmetric about f= 0; this is a by-product of how Fourier 
transforms are computed and the fact that autocorrelation functions are always 
symmetric in 7. But we must make sense of it in terms of “true” (i.e., nonnegative) 
frequencies. Look back at Example 8.2: that particular phase variation random 
process had a power spectral density consisting of two impulses, each of intensity 
12,100 W/Hz, at +1 kHz. We can interpret the impulse at —1000 by mentally 
“folding” the power spectrum along the vertical axis, left to right, so that the two 
impulses line up at +1 kHz with a total intensity of 24,200 W/Hz. Integrating that 
impulse df recovers the ensemble average power of 24.2 kW. 

Our next example illustrates a more general psd, including components other 
than impulses. 


Example 8.3: Partitioning a power spectrum. Suppose X(t) and X.(t) are 
independent, zero-mean, WSS random processes with autocorrelation functions 


R(t) = 2000tri(10,000r), Ro2(t) = 650cos (40,000x7) 


Define a new random process by X(t) = X,(t) + X2(t) +40. We encountered this 
random process in Example 7.18, from which we know that X(f) is WSS with a 
mean of fy = 40 and an autocorrelation function of 
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Ryx(t) => Rii(z) + R(t) + 40? 
= 2000tri(10,000z) + 650cos(40,000nr) + 1600 


First, let’s find the ensemble average power in X(): 


Py = Rxx(0) = 2000 + 650 + 1600 = 4250W 


Recall from Example 7.18 that X(f) consists of three pieces: the aperiodic 
component X,(t), the periodic component X>(f), and the dc offset of 40. These 
deliver a total of 4.25 kW of power: 2000 W from X,(t), 650 W from X.(t), and 
1600 W from the de power offset (recall that we always assume R=1 Q, so 
P=PR=40"(1) = 1600 for that term). 

Next, let’s see how this 4250 W of power is distributed in the frequency domain 
by determining the power spectral density of X(). Apply the Wiener—Khinchin 
Theorem: 


Syx(f) = F {Rxx(z)} = F {2000tri (10,0007) + 650 cos (40,000n7) + 1600} 
= 2000.7 {tri(10,000z) } + 650.7 { cos (40,000xz) } + 7 {1600} 


To evaluate each of these three Fourier transforms, we use the table of Fourier 
pairs in Appendix B. The last two are straightforward, while the transform of 
tri(10,000r) requires the rescaling property with a= 10,000. Since the Fourier 
transform pair of tri(¢) is sinc?( J), the ultimate result is 


Sxx(f) = 2000 - ae 
oe [10,000] 10,000 


+ 650: : [5(f — 20,000) + 8(f + 20,000) | + 16008 (f) 


= 0.2sinc” ( ) + 325[8(f — 20,000) + 8(f + 20,000) | + 16008(f) 


ach 
10,000 


A graph of this power spectrum appears in Fig. 8.2. Notice the graph is 
symmetric about the vertical axis f=0, as guaranteed by property 2 of the previous 
proposition. The psd consists of three elements, corresponding to the three 
components of the original signal. The power spectrum of the aperiodic component 
appears as a continuous function (a true “density”’) that vanishes as | f1— oo. This is 
sometimes referred to as the dissipative component of the psd. The periodic 
component of the signal has psd equal to a pair of impulses (sometimes called 
split impulses) at its fundamental frequency—here, 40,000z radians, or 20 kHz. 
Finally, the direct current corresponds to a “frequency” of f=0; thus, the dc power 
offset of 1600 W is represented by 16005(f), an impulse at f=0. 
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Syy(f) 
A 


0.20 


(325) 
.05 
r T T T T T T T T >f 
40000 -30000 -20000 -10000 0 10000 20000 30000 40000 
Fig. 8.2 Power spectral density of Example 8.3 a 


As illustrated in the foregoing example, a power spectral density generally consists 
of at most three pieces, {dissipative components} + {periodic components} + 
{dc power offset}, and the last two will be comprised of impulses. 


8.1.2. Power in a Frequency Band 


Suppose we wish to determine how much of the power in a random signal lies 
within a particular frequency band; this is, as it turns out, a primary purpose of the 
psd. For frequencies f, and f. with 0</f| </fh, let PxI fi, fo] denote the expected 
power in X(t) in the band [f;, /5]. Then, to account for the two sides of the power 
spectrum, we calculate as follows: 


fo fi fo 
Puff ifal = | sui) df + | Sex(f) af = 2 Su Ad (8.3) 
fi fo fi 


The last two expressions in Eq. (8.3) are equal because Syx(f) is an even 
function. Figure 8.3a shows a generic power spectrum. Figure 8.3b shows the 
calculation of power in a band, accounting for the two sides of the psd; it’s clear 
that we could simply double the right-hand area and get the same result. 

Extra care must be taken to find the power in X(t) below some frequency fy, 1.e., 
between 0 and f, including the possible dc power offset at f= 0. When we “fold” the 
negative frequencies over to the positive side, any power represented by an impulse 
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a Sy) b Syx/) 
A 


Fig. 8.3 (a) A generic power spectral density; (b) the ensemble average power in a specified 
frequency band 


at f= 0 is not duplicated. Therefore, we cannot simply double the entire integral of 
Syx(f) from 0 to f4; we must count the dc power offset a single time, and then 
integrate the rest of the psd. Written mathematically, 


fo fo 
Px(0.fal = | Su ff = (de power offset) +2] Su(Ad? — (84) 
ho or 


The lower limit 0* in Eq. (8.4) indicates that the integral term does not include an 
impulse at zero, should one exist. 


Example 8.4 For the random process X(t) in Example 8.3, let’s first find the 
ensemble average power in the band from 10 to 30 kHz. With f; = 10,000 and 
f2= 30,000, we proceed as follows: 


30,000 
Px[10,000, 30,000] = 2 | Sxx(f)df 
10,000 
30,000 
=) o.sinc’( hay +325 [5( f — 20,000) +8(f + 20,000)] + 16008(f) | df 
10,000 : 
30,000 30,000 
; id | 
=2 2Qsine” | —*_~ jaf +2 2 —2 d 
| 0.2sinc ( 0,000 if + 3258( f — 20,000) df 
10,000 10,000 
30,000 30,000 
+2 | 3258(f + 20,000)df +2 | 16008(f)df 
10,000 10,000 


To evaluate the integrals of the three impulses, we use the sifting property (see 
Appendix B); since the specified frequencies of the last two impulses lie outside the 
band [10,000, 30,000], those two integrals are zero. The calculation continues 
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30,000 
Px[10,000, 30,000] = 2 | 0.2sine*( 
10,000 


30,000 
pA | sin? (nf /10,000) 
, ; (xf /10,000)* 


i 


ana + 2(325) +2(0) +0 


df + 650 = 127.17 + 650 = 777.17W 


10,00 


This last integration of the sinc* function requires software (or an advanced 
calculator). Next, let’s find the average power in X(f) concentrated below 10 kHz. 
We must remember to include the impulse representing the dc power offset at f= 0, 
but only once. Also, we can ignore the impulses at +20 kHz, since they lie outside 
our desired range. Applying Eq. (8.4), 


10,000 10,000 
Px{0, 10,000] = | Sxx(f)df = 1600 + 2 | Sxx(f)df 
—10,000 ot 
10,000 
Ep 
= 1600 +2 .2 sinc? ( —“_ 
600 | 0.2 sinc (soto Ja" 


0 
= 1600 + 1805.65 = 3405.65 W 


Again, a numerical integration tool is required. = 


8.1.3. White Noise Processes 


As mentioned previously, engineers frequently use random process models in an 
attempt to describe the noise acquired by an intended signal during transmission. 
One of the simplest models, called white noise, can most easily be described by its 
frequency representation (as opposed to the time-domain models of Chap. 7). 


DEFINITION 
A random process N(f) is (pure) white noise if there exists a constant No > 0, 
called the intensity parameter, such that the psd of N(f) is 
No 
Sw (f) =e <f < 00 
As a special case, N(f) is called Gaussian white noise if N() is a Gaussian 
process as defined in Sect. 7.6 and its psd is as above. 
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a Sati b Ry @ 
A 
No/2 (No/2) 
> f 0 >T 


Fig. 8.4 Pure white noise: (a) power spectral density; (b) autocorrelation function 


The power spectral density of pure white noise appears in Fig. 8.4a. A white 
noise model assumes that all frequencies appear at equal power intensity through- 
out the entire spectrum. In that sense, it is analogous to white light (all frequencies 
at equal intensity), which gives white noise its name. 

White noise processes can also be partially described in the time domain through 
the autocorrelation function: 


Ryw(t) = F~"{Sxx(f)} = ryt ~ Sale 


Figure 8.4b shows this autocorrelation function. From property 5 of the main 
proposition in Sect. 7.3, it follows that the mean of a white noise process is fly = 0. 
(That’s also evident from the psd itself, since it lacks an impulse at f= 0 that would 
correspond to a dc power offset.) Thus, the autocovariance function of pure white 
noise is also Cyn(t) = Rya(t) = (No/2)8(2). 

This has a rather curious consequence: since 5(z) = 0 for t 4 0, Cyt) = (No/2)8(t) 
implies that the random variables N(‘) and N(t+7) are uncorrelated except when 
t=0. If M(t) is Gaussian white noise, then N(‘) and N(t+7) are independent for 
t £0 (since uncorrelated implies independent for normal rvs), even if the two times 
are very close together. That is, a pure white noise process has the property that its 
location at any given time is completely uncorrelated with, say, its location the 
nanosecond before! 

Although the pure white noise model is commonly used in engineering practice, 
no such process can exist in physical reality. In order for the description in the 
preceding paragraph to be true, the process would have to “move’”’ infinitely 
quickly, thus requiring infinite power. This can be seen directly from the definition: 


Py= | Sxx(f)df = 


g—=8 
\2 
oO 
& 
I 


That is, the area under the curve in Fig. 8.4a is infinite. So, why use a model for a 
process that cannot exist? As we’ll see in Sect. 8.2, when a white noise process is 
passed through certain filters, the resulting output will have finite power. Thus, if 


694 8 Introduction to Signal Processing 


a Syy (f) b Syn (f) 
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Fig. 8.5 Examples of band-limited white noise: (a) lowpass; (b) bandpass 


we are interested in analyzing the filtered version of our communication noise, 
using the very simple model of pure white noise for the input is not unreasonable. 

Though pure white noise cannot exist, various types of band-limited white 
noise are physically realizable. Two types of band-limited white noise, called 
lowpass white noise and bandpass white noise, are depicted in Fig. 8.5.' Notice 
that the area under both of these power spectral densities is finite, and thus the 
corresponding random processes both have finite power. 


8.1.4 Power Spectral Density for Two Processes 


For two jointly WSS random processes X(t) and Y(t), the cross-power spectral 
density of X(t) with Y(4) is defined by 


Sxy(f) = F {Rxy(z)}, 


where Ryy(t) is the cross-correlation function defined in Sect. 7.2. A similar 
definition can be made for Syx(f). Since Ryy(z) is generally not an even function 
of 7, the cross-power spectral density need not be real-valued. When X(t) and Y(t) 
are orthogonal random processes, Ryy(t) = 0 by definition and so Syy(f) = 0. The 
cross-power spectral density gives information about the distribution of the power 
generated by combining X(‘) and Y(f), above and beyond their individual power 
spectra, when X(t) and Y(t) are not orthogonal. See Exercise 16. 


Proof of the Wiener—-Khinchin Theorem The definition of Syx(f) involves the 
squared magnitude of a complex function; from the theory of complex numbers, we 
know that IzI?=z-z*, where * denotes the complex conjugate. The proof then 
proceeds as follows: 


' Readers already familiar with filters will recognize the terms “lowpass” and “bandpass.” We will 
see these terms again in the next section. 
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e |Fr(f)|? ‘ 
Sxx(f) = jim E | — dim = E|Fr(f)Fr(f)] 
T T . 
= =a eo /2mfs —j2nft 
= Jim 5 | x0) ds [xe dt 
af -T 
T 
= ae e /2ats j2nft 
jim =e [x@ as|[ xe dt 
-T =i 
TT 
= jim i, | | x(oox( ye P#—) dtds 
=f=T 


Next, pass the expected value into the integrand (which is permissible 
because the integral converges), and use the fact that wide-sense stationarity 
implies E[X(s)X()] = Rxx(s — 0): 


T T T 
| fe[x) X(jeP¥ 9] deds = [ex (s)X(N)eP#O™ deds 
—T- 


T 


T 
| Ryx (s = the Pat(s—) dtds 


Now make the change of variables t= 5— tf (1.e., s=t+7), under which the 
region of integration becomes the parallelogram pictured in Fig. 8.6. 
Integrating in the order dt dz yields the sum of two integrals: 


1 0 r+T 27 T 
Sxx(f) = jim ag | | Ryx(t)e PP" dtdr + | | Ryx(t)e ? dtdr 
—2T -T 0 7c-T 
1 0 2T 
= Jim Yi | Ryx(t)e PP" (2T + 2)dt + | Rux(a)eP** ar —t)dt 
—2T 0 
2T 
1 P 
= jim = | Ryx(r)e ?*(2T — |r|)dz 
—2T 
2T 
= jim | Ryx(t)e P| 1 lel dt 
—2T 
= im | Rex()eP*ar(o dr 


—oo 
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Fig. 8.6 Region of at 
integration for the proof of the 0 
Wiener—Khinchin Theorem 0, 7) (2T, T) 
>T 
(27, T) (0, -T) 


where q7(t) = | — Icl/2T for Izl < 2T and 0 otherwise. Since g7(t) — 1 as T— oo for 
all z, we conclude that 


Sxx(f) _ | Rxx(t)e Pdr = F XRxx(t)}, 


as claimed. | | 


8.1.5 Exercises: Section 8.1 (1-21) 


1. The function rect(z) satisfies all the properties of an autocorrelation function for 
a WSS process that were specified in the main proposition of Sect. 7.3: rect(z) is 
even, has its maximum value at 0, vanishes as t — oo. However, rect(r) cannot 
be the autocorrelation of a WSS random process. Why not? [Hint: Consider the 
resulting psd.] This demonstrates that the properties listed in that proposition 
do not fully characterize the types of functions that can be autocorrelations. 

2. Let A(f) be a WSS random process with autocorrelation function R,4,4(c) and 
power spectral density S,,(f). Define an “amplitude modulated” version of 
A(t) by 


X(t) = A(t)cos(2nfot + 9), 


where © ~ Unif(—z, 2] and is independent of A(‘). 
(a) Find the mean and autocorrelation functions of X(f). 
(b) Find the power spectral density of X(d). 
(c) Find an expression for the expected power in X(f). 
3. Suppose X(t) is a wide-sense stationary process with the following autocorre- 
lation function: 


Rxx(t) = 250 + 1000exp(—4 x 10°r”) 
(a) Find and graph the power spectral density of X(‘). 


(b) Find the ensemble average power in X(t) between 500 Hz and 1 kHz. 
(c) Find the ensemble average power in X(t) below 200 Hz. 
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. Let A(t) be a wide-sense stationary waveform with autocorrelation function 


Raa(t) = 2400sinc(2000z). Define a new random process X(t) by 
X(t) = 20 + A(t)cos(5000zt + ©) 


where @ is uniform on (—z, 1] and independent of A(d). 

(a) Find the mean function of X(t). 

(b) Find the autocorrelation function of X(t). Is X(t) WSS? 

(c) Find the expected power in X(f). 

(d) Find and sketch the power spectral density of X(f). 

(e) Find the expected power in X(f) in the frequency band from 2 to 3 kHz. 


. Suppose X(t) is a WSS random process with power spectral density 


Syx(f) = 0.2 exp(—n’f?/10!”). 

(a) Sketch the psd, and find the expected power in X(f). 

(b) Find the expected power in X(t) above 10 kHz. 

(c) Find the autocorrelation function of X(t) and verify your answer to (a). 


. Let X(4) be a WSS random process with mean py = 32.6 and autocovariance 


function Cy y(t) = 12,160sinc?(40,000z). 

(a) Find and sketch the power spectral density of X(a). 

(b) Find the expected power in X(f) below 20 kHz. 

(c) Find the expected power in X(t) between 10 and 30 kHz. 
(d) Find the total expected power in X(t). 


. Let N(d) be lowpass white noise, i.e., N(t) is WSS with power spectral density 


given by Sya(f) =Np/2 for If < B and 0 otherwise (see Fig. 8.5a). 
(a) Find the expected power in N(?). 
(b) Find the autocorrelation function of N(A). 


. Let N(t) be bandpass white noise, i.e., N(t) is WSS with power spectral density 


given by Syn(f)=No/2 for fo—B/2<\|fl<jfo+B/2 and 0 otherwise (see 
Fig. 8.5b). 

(a) Find the expected power in N(?). 

(b) Find the autocorrelation function of N(). 


. Let M(t) be a Poisson telegraphic process with parameter 1 as defined in 


Sect. 7.5, and consider Y(t) = AgN(t) for some constant Ap > 0. 

(a) Find the autocorrelation function of Y(7). 

(b) Find and sketch the power spectral density of Y(). 

(c) Find the expected power in Y(/). 

(d) What proportion of the expected power in Y(t) lies below the frequency 
A Hz? 

Let X(t) have power spectral density Syx(f) =No—IfIV/A for |fl<B (and zero 

otherwise), where B < NA. 

(a) Find the expected power in X(f). 

(b) Find the autocorrelation function of X(¢). 

[Hint: It may be helpful to sketch Syx(f) first.] 


698 


11. 


12. 


13. 


14. 
15. 


16. 


17. 


18. 


19. 


20. 


8 Introduction to Signal Processing 


Suppose a random process X(t) has autocorrelation function Ry y(zt)= 

10007"! +5007" +500 

(a) Find the expected power in X(f). 

(b) Find and sketch the power spectral density of X(¢). 

(c) Find the expected power in X(t) below 1 Hz. 

Let X(t) be a WSS random process, and define a d-second delay of X(t) by 

Y() = X(t — d). Find the mean, autocorrelation, and power spectrum of Y(¢) in 

terms of those of X(f). 

Let X(t) be a WSS random process, and define a d-second “moving window” 

process by W(t)=X(t)—X(t—d). Find the mean, autocorrelation, and power 

spectrum of W(t) in terms of those of X(f). 

Let X(t) and Y(4) be jointly WSS random processes. Show that Syy(f) = S¥x(f). 

Let X(t) and Y(t) be orthogonal and WSS random processes, and define 

Zit) =X(t) + Yo). 

(a) Are X(f) and Y(t) jointly WSS? Why or why not? 

(b) Is Z() WSS? 

(c) Find the psd of Z(t). 

Let X(‘) and Y(t) be non-orthogonal, jointly WSS random processes, and define 

Z(t) = X(t)+ Y(t). 

(a) Find the autocorrelation function of Z(f). Is Z(t) WSS? 

(b) Find the power spectral density of Z(f), and explain why this expression is 
real-valued. 

Let X(t) and Y(t) be independent WSS random processes, and define 

Z(t) = X()Y(0). 

(a) Show that Z(t) is also WSS. 

(b) Find the psd of Z(?). 

Pink noise, also called 1/f noise, is characterized by the power spectrum 

Syn(f) = If for fA 0. 

(a) Explain why such a process is not physically realizable. 

(b) Consider a band-limited pink noise process with psd Syjq(f) = If! for 
fo <|fl<fi. Find the expected power of such a random process. 

(c) A “generalized pink noise” process has the psd Sya(f) =No/(2if\'**) for 
|f| > fo and 0 otherwise, where 0 < # < 1. Find the expected power of such a 
random process. 

Highpass white noise is characterized by the power spectrum Syjy(f) = No/2 for 

|f| > B and 0 otherwise. Is highpass white noise a physically realizable process? 

Why or why not? 

The ac power spectral density (ac-psd) of a WSS random process is defined as 

the Fourier transform of its autocovariance function: 


Sxx(f) = F {Cxx(z)} 
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(a) By using the relationship between Cyx(z) and Ryx(t), develop an equation 
relating the psd of a random process to its ac-psd. 

(b) Find the ac-psd for the random process of Example 8.3. 

(c) Explain why the term “ac power spectral density” is appropriate. 

21. Exercise 36 of Chap. 7 presented a random process of the form X(1) =A - Y(d), 
where A is a random variable and Y(t) is an ergodic, WSS random process 
independent of A. It was shown that X(t) is WSS but not ergodic. 

(a) Find the psd of X(a). 

(b) Find the ac-psd of X(f). (See the previous exercise.) 

(c) Does the ac-psd of X(t) include an impulse at zero? What does this say 
about our interpretation of “dc power offset” for non-ergodic processes? 


8.2. Random Processes and LTI Systems 


For any communication system to be effective, one must be able to successfully 
distinguish the intended signal from the noise it encounters during transmission. If 
we understand enough about the statistical properties of that noise, then in theory a 
filter can be constructed to minimize noise effects, thereby making the signal easier 
to “hear.” This section gives a very brief overview of filters” and then investigates 
aspects of applying a filter to a random, continuous-time signal. 

In communication theory, a system refers to anything that operates on a signal. 
We will denote a generic system by the letter L. If we let x(t) and y(t) denote the 
input and output of this system, respectively, then we may write 


y(t) = Le) 


where L[] denotes the application of the system to a signal. One particular class of 
systems is of the greatest interest, since they form the backbone of filtering. 


DEFINITION 

A linear, time-invariant (LTI) system L satisfies the following two 

properties: 

1. (Linearity) For all functions x,(4) and x2(f) and all constants a, and a>, 
Lia, x1 (t) + a2X2(t)] = aiL[x (4)] + aoL[x9(0)] 


2. (Time invariance) For all d>0, if yO=L[x@], then y¢—d)= 
L{[x(t—d)]. 


? Readers interested in a thorough treatment of filters and other systems should consult the 
reference by Ambardar. 
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Part 2 of this definition says, in essence, that it does not matter on an absolute 
time scale when we apply the LTI system to x(); the response will be the same, 
other than the time delay. As it turns out, an LTI system can be completely 
characterized by its effect on an impulse, essentially because a signal can generally 
be decomposed into a weighted sum of impulses, and then we may apply linearity. 
With this in mind, an LTI system is described in the time domain by its impulse 
response (function), denoted (1): 


h(t) = L[5(4)] 


It can be shown (see Chap. 6 of the reference by Ambardar) that if L is an LTI 
system with impulse response /(¢), then the input and output signals of L are related 
by a convolution operation: 


y(t) = x(t)kA(t) = | x(s)h(t — s)ds = | x(t — s)h(s)ds (8.5) 


The same relationship holds for random signals, i.e., if X(¢) is the random input 
to an LTI system and Y(t) the output, then Y(t) = X() * A(2). 

The appearance of a convolution operator suggests it would be desirable to apply 
a transform to Eq. (8.5). The Fourier transform of the impulse response, denoted 
H(f), is called the transfer function of the LTI system: 


H(f) = F{n()} 


For deterministic signals, we may then write Y(f) =X(f)H(f), where X(f) and 
Y(f) denote the Fourier transforms of x(t) and y(t), respectively. However, Fourier 
transforms of random signals do not exist (due to convergence issues), so the 
transfer function H(f) cannot be defined as the ratio of the output and input in 
the frequency domain as one commonly does in other engineering situations. Still, 
the transfer function will prove critical in determining how the power in a random 
signal X(f) is “transferred” by an LTI system, as we will see shortly. 


8.2.1 Statistical Properties of the LTI System Output 


The following proposition summarizes the relationships between the statistical 
properties of the random input signal X(7) of an LTI system and the corresponding 
output signal Y(t). Here X(t) is again assumed to be wide-sense stationary. 
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PROPOSITION 

Let L be an LTI system with impulse response /(f) and transfer function H(f). 
Suppose X(t) is a wide-sense stationary process and let Y(t)=L[X(d], the 
output of the LTI system applied to X(t). Then X(f) and Y(f) are jointly WSS, 
with the following properties. 


Time domain Frequency domain 

1. py =x: | h(s)ds 1. py=H,y: H(0) 

2. AiG an. Oleic 2. Syy(f) =Sxx(f) IAP )P 
3. Py=Ryy(0) 3. Py = { Syy(f)df 

4. Ryy(t) = Rxx(t) & h(z) aD Sef) = Sxx(f)-H(f) 


The quantity |H(f)I° in property 2 is called the power transfer function of the 
LTI system. 


Proof Using the convolution relationship between X(f) and Y(f), 


Y(t) =X(t)«A(t) = | X(t—s)h(s)ds > 


Since X(t) is WSS, the expression E[X(t — s)] is just a constant, x, from which 
E(YO) = xf A(s)ds, as desired. Since this expression does not depend on f, we 
deduce that the mean of Y(7) is constant (and we may denote it wy). This establishes 
property 1 in the time domain. For the parallel result in the frequency domain, 
simply note that since H(f) = 7 {h(t)}, it follows from the definition of the 
Fourier transform that jae h(s)ds = H(0). 

A similar (but vastly more tedious) derivation yields property 2 in the time 
domain (see Exercise 31). The right-hand side establishes that the autocorrelation of 
Y(t) depends only on 7 and not ¢, and therefore Y(t) is indeed WSS. Hence, the 
Wiener—Khinchin Theorem applies to Y(#), and taking the Fourier transform of both 
sides gives 
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F {Ryy(t)} = F {Rxx(t) &h(t) kh( — c)} > 
Srv(f) = F {Rex(2)} F{h() } 7 {a(—2)} 
= SuxH()A* (f), 


where H'(f) denotes the complex conjugate of H(f). Now, recall that for 
any complex number z, z- z* = |z|?. We immediately have H(f)H (f)= lH(f)I’, 
completing property 2 in the frequency domain. 

Both the time and frequency versions of property 3 follow immediately from 
Sect. 8.1 and the fact that Y(t) is WSS. The proofs of property 4 in the time and 
frequency domain are parallel to those of property 2. a 


The frequency domain properties of the previous theorem are the most 
illuminating. Property 1 says the dc offset of X(‘), px, is “transferred” to the dc 
offset of Y(t) by evaluating the transfer function H(f) at 0. This makes sense, since 
the dc offset corresponds to the frequency f= 0. Notice in particular that if 7 = 0, 
necessarily fy = 0; an LTI system cannot introduce a dc offset if none exists in the 
input signal. 

Property 2 states that the power spectrum of the output of an LTI system is 
obtained from the input psd through multiplication by the quantity IH(f)I’, hence 
the name “power transfer function.” Similar to the preceding discussion about dc 
offset, observe that if X(¢) carries no power at some particular frequency f (so that 
Syx(f) = 0), then Syy(f) will be zero there as well. An LTI system cannot introduce 
power to any frequency that did not appear in the input signal. 


Example 8.5 One of the simplest filters is an RC circuit, an LTI system whose 
impulse response is given by 
_ 1 —t/RC 
h(t) = RC’ u(t) 

where u(f) is the unit step function, equal to 1 for t>0 and zero otherwise. (The 
product RC of the resistance and the capacitance is called the time constant of the 
circuit, since its units are seconds. The unit step function makes h(t) equal O for 
t <0; engineers call this a causal filter.) Suppose we have such a circuit with time 
constant RC and that we model the input to our system as a pure white noise process 
with power spectral density Syx(f) = No/2 W/Hz. Let’s investigate the properties of 
the output, Y(¢). 

First, since white noise has mean zero, it follows that zy = 0 as well (property 1). 
Now we need the transfer function of the system: 


= FF = = Fs et! ae e = 
H(f) = F{h(H} =o? {e (oS — RC (1/RC + jonf)*! 1 + j2nfRC 
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Fig. 8.7 Power spectral Syy(f) 
density of Y(t) in Example 8.5 rY 
No/2 
>f 
Next, we find the psd of Y(f) using property 2: 
No 1 [No ig 
S = S H 2 => — . 
w(f) = Sux(f) |H)| 2 {1 + j2afRC 2 12+ (2nfRC)” 


_ No/2 
1+ (2nfRC)? 


Figure 8.7 displays this power spectral density. Finally, the ensemble average 
power of Y(f) is given by 


7 T — No/2 No f df No 
re | uve | 1+(2yfRCy 2 | 1+ (2afRC)? 4RC 


where the integral is evaluated by the substitution x= 2xfRC and the fact that the 
antiderivative of 1/(1 +x’) is arctan(x). 

We find that, even though the input signal had (theoretically) infinite power, the 
output Y(¢) has finite power, directly proportional to the intensity of the input and 
inversely proportional to the time constant of the circuit. (As an exercise, see if you 
can verify that the units on the final expression for power are indeed watts.) a 
Example 8.6 An LTI system has an impulse response of h(t) = Pe ‘u(t). The input 
to this system is the random process 


X(t) =V +500 cos (2 x 10°xr + @), 


where V and © are independent random variables, © is uniformly distributed on 
(—2, x], and V has mean 60 and variance 12. It was shown in Exercise 25 of Chap. 7 
that X(f) is WSS, with mean py=py=60 and autocorrelation function 
Ryx(t) = 3612 + 125,000cos(2 x 10°xr). (Depending on whether we choose to 
interpret X(f) as a voltage or current waveform, the units on the mean are either 
volts or amperes.) Applying the Wiener—Khinchin Theorem, the psd of X(f) is 
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Sux(f) = F {Rxx(2)} 
= F {3612 + 125,000 cos (2 x 10°xz) } 
= 36128(f) + 62,5008 (f — 10°) + 62,5008(f + 10°) 


Since X(¢) consists of a (random) dc offset and a periodic component, the power 
spectrum of X(t) is comprised entirely of impulses. Now let Y(‘) denote the output of 
the LTI system. To deduce the properties of Y(t) requires the transfer function, 
H(f), of the LTI system. Using the table of Fourier transforms in Appendix B, 


2! 2 


HO) = FAMO) = F {8 WO) = T= Cae 


According to property | of the earlier proposition, the mean of the output signal 
Y(d is given by 
2 


Hy Hx ( ) (1 + j2n-0)° 


To find the psd of Y(4), we must first calculate the power transfer function of the 
LTI system: 


2 
(1 + j2af)° 


2)? 4 


(b+ jaf?) (1 4 (2np)?)" 


AA = 


Since the input power spectrum consists of impulses, so does the output power 
spectrum; the coefficients on the impulses are found by evaluating the power 
transfer function at the appropriate frequencies: 


Syy(f) =Sxx(f) |H(f) |? 
= 36128(f)|H(f) |? + 62,5008( f — 10°) |H(f) |? + 62,5008 (f + 10°) |H(f)|? 
= 36128(f)|H (0) |? + 62,5008 ( f — 10°) | (10°) |? + 62,5008 (f + 10°) |H (—10°) |? 


= 36126(f) Tay 280086 10) (1 +(2 mec 
4 

(1+ (-2x 10°x)*)’ 

= 14,4488(f) +4 x 10°°°S(f — 10°) +4 x 10-%°8(f + 10°) 


—=~ — 


+ 62,5008 (f + 10°) - 


The effect of the LTI system is to “ramp up” the dc power and to effectively 
eliminate the power at | MHz. In particular, the expected power in the output signal 
Y(f) is 
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Pee | Syy(f)df = 14,448 + 2(4 x 107-*°) = 14.448kw, 
with essentially all of the power coming from the dc component. = 


8.2.2 Ideal Filters 


The goal of a filter is, of course, to eliminate (‘filter out”) whatever noise has 
accumulated during the transmission of a signal. At the same time, we do not want 
our filter to affect the intended signal, lest information be lost. Ideally, we would 
know at what frequencies the noise in our transmission exists, and then a filter 
would be designed that completely eliminates those frequencies while preserving 
all others. (If the frequency band of the noise overlaps that of the signal, one can 
modulate the signal so that the two frequency bands are disjoint.) 


DEFINITION 
An LTI system is an ideal filter if there exists some set of frequencies, Fass, 
such that the system’s power transfer function is given by 


2 _J1 forf &Foass 
A) = otherwise 


If we let X(t) denote the input to the system (which may consist of both signal 
and noise) and Y(f) the output, then for an ideal filter we have 


SCF. 
Syy(f) = Sxx(f)|A(A)|? - eG a 

In other words, the power spectrum of X(f) within the band F'pas, is unchanged by 
the filter, while everything in X(f) lying outside that band is completely eliminated. 
Thus, the obvious goal is to select Fas, to include all frequencies in the signal and 
exclude all frequencies in the accumulated noise. 

Figure 8.8 displays lH(f)I for four different types of ideal filters. To be consistent 
with the two-sided nature of power spectral densities, we present the graphs for 
—oo <f< oo, even though plots starting at f= 0 are more common in engineering 
practice. Figure 8.8a shows a lowpass filter, which preserves the signal up to some 
threshold B. Under our notation, Fass; = [0, B] for an ideal lowpass filter. The ideal 
highpass filter of Fig. 8.8b does essentially the opposite, preserving frequencies 
above B. Figure 8.8c, d illustrate a bandpass filter and a bandstop filter (also 
called a notch filter), respectively. 

The previous section briefly mentioned band-limited white noise processes, 
wherein we also used the terms “lowpass” and “bandpass.” These models inherit 
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a lH(f)| b lH(f)| 
ry ry 


my ~f 


c IH(f)| d lH(f)| 
A A 


+ t >f >f 
fo Jo fo to 


Fig. 8.8 Ideal filters: (a) lowpass; (b) highpass; (c) bandpass; (d) bandstop 


their names from the aforementioned filters, e.g., if pure white noise passes through 
an ideal bandpass filter, the result is called bandpass white noise. 


Example 8.7 A WSS random signal X(t) with autocorrelation function 
Ryx(t) = 250 + 1500exp(—1.6 x 10°z) is passed through an ideal lowpass filter 
with B= 10 kHz (ie., 104 Hz). Before considering the effect of the filter, let’s 
investigate the properties of the input signal X(t). The ensemble average power of 
the input is Py =Ryx(0) = 250+ 1500 = 1750 W; moreover, we recognize that 
250 W represents the dc power offset while the other 1500 W comes from an 
aperiodic component. Applying the Wiener—Khinchin Theorem, the input power 
spectral density is given by 


Syx(f) = F{Rxx(z)} = F {250 + 1500exp(1.6 x 10°z”) } 


2508(f) + 1500.7 {exp(—1.6 x 10°7*)} 


The second Fourier transform requires the rescaling property; however, we must 
be careful in identifying the rescaling constant. If we rewrite 1.6 x 10°r* as 
(4 x 10*z)*, we see that the appropriate rescaling constant is actually a=4 x 10. 
Continuing, 


Syx(f) = 2508(f) + 1500.7 {exp( — (4 x 10*z)”)} 
= 2508(f) + 1500- mv exp(—12’(f/4 x 10*)”) 


10* 
= 2508(f) + SV oxp| —x°f?/1.6 x 10”) 
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Fig. 8.9 Power spectral densities for Example 8.7: (a) input signal; (b) output signal 


This psd appears in Fig. 8.9a. Now let’s apply the filter, and as usual let Y(t) denote 
the output. Then, based on the preceding discussion, the psd of Y(f) is given by 


Syy(f) = Pa FE Fass 


0) otherwise 


2508(f) +3 eR Pex || < 10*Hz 
0 otherwise 


Figure 8.9b shows the output power spectrum, which is identical to Syy(f) in the 
preserved band [0, 10*] and zero everywhere else. 

The ensemble average power of the output signal Y(f) is calculated by taking the 
integral of Syy(f), which in this case requires numerical integration by a calculator 
or computer: 


104 


Pr= [Swift =| |2505(4) +e erie" lap 


10* 
3 
= 250+ 2| SVR wr 16x10" GF ~ 250 + 1100 = 1350W 
0 


—10* 


In the preceding example, the output power from the ideal filter was less than the 
input power (1350 W < 1750 W). It should be clear that this will always be the case: it is 
impossible to achieve a power gain with an ideal filter of any type. At best, if the entire 
input lies within the preserved band F’,,;s, then the input and output power will be equal. 

Of course, in practice one cannot actually construct an “ideal” filter—there is no 
engineering system that will perfectly cut off a signal at a prescribed frequency. But 
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Fig. 8.10 Power transfer functions for Butterworth filters (approximations to ideal filters) 


many simple systems can approximate our ideal. For instance, consider Example 
8.5: the power transfer function of that RC filter is identical to Fig. 8.7 (except that 
the height at f= 0 is 1 rather than No/2). This bears some weak resemblance to the 
picture for an ideal lowpass filter in Fig. 8.8a. In fact, a more general class of LTI 
systems called Butterworth filters can achieve an even more “squared off” appear- 
ance; the nth-order Butterworth filter has a power transfer function of the form 


a 


H(f)|? =———_, 
| (f)| 1 + (B2nf)" 


where the constants a and # can be derived from the underlying circuit. The RC 
filter of Example 8.5 is a “first-order” (i.e., 7 = 1) Butterworth filter. The books by 
Peebles and Ambardar listed in the references provide more information. Examples 
of these power transfer functions are displayed in Fig. 8.10. 


8.2.3 Signal Plus Noise 


For a variety of physical reasons, it is common in engineering practice to assume 
that communication noise is additive, i.e., if our intended signal X(t) experiences 
noise M(t) during transmission, then the received transmission (prior to any filter- 
ing) has the form X(f)+N(‘). We assume throughout this subsection that X(t) and 
N(t) are independent, WSS random processes and that E[N(‘)] =0 (i.e., the noise 
component does not contain a de offset, a standard engineering assumption).° 

The mean of the input process is given by E[X(t)+N()] = E[X()] + E[N(O] = 
fx +0= px, the de offset of the input signal. Computing the autocorrelation of the 
input process relies on the assumed independence: 


Please note: The case of a deterministic signal x(t) must be handled somewhat differently. 
Consult the reference by Ambardar for details. 


8.2 Random Processes and LTI Systems 709 


om 
A 
i‘ 
oo 
++ 
= 
ES 
pss 
> 
cP 
+ 
= 
= 
ce 


= Ryx(t) +Ryv(t) 


Then, by the Wiener—-Khinchin Theorem, the input power spectrum is 


Sin(f) = F {Rxx(t) + Rnw(t)} = Sxx(f) + Syn (Ff) 


Now we imagine passing the random process X(ft)+N(f) through some LTI 
system L (presumably a filter intended to reduce the noise). The foregoing 
assumptions make the analysis of the output process quite straightforward. To 
start, the linearity property allows us to regard the system output as the sum of 
two parts: 


L[X(t) + N(0)] = LIX()] + LIN()] 


That is, we may identify L[X(#)] and L[N(t)] as the output signal and output 
noise, respectively. These two output processes are also independent and WSS. 
Letting H(f) denote the transfer function of the LTI system, the mean of the output 
signal and output noise are, respectively, 


Myx) = E(EIX(9)]) = #0), Hayy) = E(LIN()]) = Hy (0) = 0 


The mean of the overall output process is, by linearity, “y.H(0)+0=pyH(0). 
Similarly, the power spectral density of the output process is 


Sou(f) = Sin(F) ACA)? = Sxx(A) ACA)? + Srv (A) |A(A)|?5 


the two halves of this expression are the psds of the output signal and output noise. 
One measure of the quality of the filter (the LTI system) involves comparing the 
power signal-to-noise ratio of the input and output: 
Purl 


Px 
SNRi, == versus SNRouw = —— 
Py a LIN] 


A good filter should achieve a higher SNR,,; than SNR;, by reducing the amount 
of noise without losing any of the intended signal. 


Example 8.8 Suppose a random signal X(f) incurs additive noise N(f) in transmis- 
sion. Assume the signal and noise components are independent and wide-sense 
stationary, X(¢) has autocorrelation function Ryx(t) = 2400 + 45,000sinc7(18007), 
and N(t) has autocorrelation function Ryj(t) = 1500e 71°! To filter out the 
noise, we pass the input X(#) + N(A) through an ideal lowpass filter with band limit 
1800 Hz. 
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Fig. 8.11 Power spectra for Example 8.8: (a) input signal; (b) input noise; (c) output noise 


Our input power signal-to-noise ratio is 


Px — Rxx(0) _ 2400 + 45,000 


SNRin = 
Py Ryn(0) 1500 


= 31.6 


The power spectral density of X(f) is 


Sxx(f) = F {Rxx(z)} = F {2400 + 45,000sinc? (18007) } 


1 
= 24008(f) + (45,000) — tri (Sm 


= i( 7 
) = 24008(f) + 25ui( TT 


) 


This psd is displayed in Fig. 8.1 1a. Notice that the entire power spectrum of the 
input signal lies within the band [0 Hz, 1800 Hz], which is precisely the preserved 
band of the filter. Therefore, the filter will have no effect on the input signal; in 
particular, the input and output signal components will have the same power 


spectral density and the same ensemble average power (47.4 kW). 


On the other hand, part of the input noise will be removed by the filter. Begin by 


finding the psd of the input noise: 
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2(10,000) 
(10,000)* + (2nf)° 


Suv (f) = F{Rww(2)} = F {15000-10000} = 1500- 


_ 3x 10’ 
108 + (2nf) 


Figure 8.1 1b shows the psd of the input noise, while in Fig. 8.1 1c we see the psd 
of the output noise L[N(f)] resulting from passing N(ft) through the ideal filter. 
The average power in the output noise is 


1800 


3 x 107 
Pry = 2 | 
0 


——_—_df =... = 808.6W, 
10° + (2nf)° 


slightly more than half the original (i.e., input) noise power. As a result, the output 
power signal-to-noise ratio equals 


Pry] = 47,400 


SNR = 
"Pry, 808.6 


= 58.6 


Because the signal and noise power spectra were so similar, it was not possible to 
filter out very much noise. Assuming our model for the input noise is correct, one 
solution would be to modulate the signal before transmission to a center frequency 
in the “tail” of the Sy(f) distribution and then employ a bandpass filter around that 
center frequency (see Exercise 30). a 


8.2.4 Exercises: Section 8.2 (22-38) 


22. Let Y(t) be the output process from Example 8.5. Find the autocorrelation 
function of Y(¢). 

23. A WSS current waveform X(t) with power spectral density Syy(f)= 
0.02 W/Hz for |f!1<60 kHz is the input to a filter with impulse response 
A(t)= 400 u(t). Let Y(t) denote the output current waveform. 

(a) Find the autocorrelation function of the input process X(t). [Hint: Draw 
Sxx(f) first.] 

(b) Calculate the ensemble average power in the input process X(f). 

(c) Find the transfer function of this filter. 

(d) Find and graph the power spectral density of the output process Y(f). 

(e) Determine the ensemble average power in the output process Y(‘). 

24. A Poisson telegraphic process M(t) with parameter A = 2 (see Sect. 7.5) is the 

input to an LTI system with impulse response h(t) = 2e ‘u(t). 

(a) Find the power spectral density of N(‘). 

(b) Find the transfer function of the LTI system. 

(c) Find the power spectral density of the output process Y(t) = L[N()]. 
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26. 


27. 


28. 


29. 


30. 
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A white noise process X(t) with power spectral density Syy(f) =No/2 is the 

input to an LTI system with impulse response h(t)=1 for O<t<1 (and 

0 otherwise). Let Y(t) denote the output. 

(a) Find the mean of Y(f). 

(b) Find the transfer function of the LTI system. 

(c) Find the power spectral density of Y(f). 

(d) Find the expected power of Y(‘). 

The random process X(t) =Aocos(@of+ ©), where © ~ Unif(—z, ], is the 

input to an LTI system with impulse response h(t) = Be *‘u(t). Let Y(t) denote 

the output. 

(a) Determine the transfer function and power transfer function of this 
system. 

(b) Find the power spectral density of Y(¢). 

(c) Determine the expected power in Y(t). How does that compare to X(t)? 

A WSS random process’ X(f) with autocorrelation function 

Ryx(t) = 100+25e™ is passed through an LTI system having impulse 

response /(t) = te u(t). Let Y(t) denote the output. 

(a) Find the power spectral density of X(t). 

(b) What is the expected power of X(1)? 

(c) Determine the transfer function and power transfer function of this 
system. 

(d) Find and sketch the power spectral density of Y(?). 

(e) What is the expected power of Y(t)? 

A white noise process X(t) with power spectral density Syx(f) =No/2 is the 

input to an LTI system with impulse response hA(f) = e ”'sin(wot)u(t). Let Y(t) 

denote the output. 

(a) Determine the transfer function of the LTI system. 

(b) Find and sketch the power spectral density of Y(7). 

Suppose X(f) is a white noise process with power spectral density 

Syx(f) =Np/2. A filter with transfer function H(f)=e~*"" is applied to this 

process; let Y(t) denote the output. 

(a) Find the power spectral density of Y(t). 

(b) Find the autocorrelation function of Y(¢). 

(c) Find the expected power of Y(t). 

Let X(t) be a WSS random process with autocorrelation function 

Ryx(t) =45 ,000sinc?(1800z); this is the signal from Example 8.8 without the 

dc offset. Suppose X(f) encounters the noise M(t) described in Example 8.8. 

Since both X(f) and M(t) are concentrated at low frequencies, it is desirable to 

modulate X(t) and then use an appropriate filter. Consider the following 

modulation, performed prior to transmission: Xmoq(t) = X(Hcos(4000zt + 0), 

where © ~ Unif(—z, 2]. The received signal will be Xmoa(t) +N(O), to which an 

ideal bandpass filter on the spectrum of Xioa(¢) will be applied. 

(a) Find the autocorrelation function of Xyoq(?). 

(b) Find the power spectral density of Xpoa(t). 

(c) Based on (b), what would be the optimal frequency band to “pass” through 
a filter? 
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31. 


32. 


33. 


34. 


35. 


(d) Use the results of Example 8.8 to determine the expected power in L[N(a)], 
the filtered noise process. 

(e) Compare the input and output power signal-to-noise ratios. How do these 
compare to the SNRs in Example 8.8? 

Let X(‘) be the WSS input to an LTI system with impulse response /(), and let 

Y(‘) denote the output. 

(a) Show that the cross-correlation function Ryy(t) equals Ryy(t)*A(z) as 
stated in the main proposition of this section. [Hint: In the definition of 
Ryy(t), write Y(t+ 7) as a convolution integral. Rearrange, and then make 
an appropriate substitution to show that the integrand is equal to 
Ryx(t— s)- h(s).] 

(b) Show that the autocorrelation function of Y(‘) is given by 


Ryy(t) = Rxy(t) ®A(—t) = Ryx (rt) h(t) &h(—7) 


(Hint: Write Y(t) = X()*h() in the definition of Ryy(z). Rearrange, and then 
make an appropriate substitution to show that the integrand is equal to 
Ryy(t — s)-h(—s). Then invoke (a).] 

A T-second moving-average filter has impulse response h(t) = 1/T forO0<t<T 

(and zero otherwise). 

(a) Find the transfer function of this filter. 

(b) Find the power transfer function of this filter. 

(c) Suppose X(f) is a white noise process with power spectral density 
Sxx(f) =No/2. If X(t) is passed through this moving-average filter and 
Y(4) is the resulting output, find the power spectral density, expected 
power, and autocorrelation function of Y(f). 

Suppose we pass band-limited white noise X(t) with arbitrary parameters No 

and B through a differentiator: 


Y(t) = LK()] = £X() 
The transfer function of the differentiator is known to be H(f) = j2af. 
(a) Find the power spectral density of Y(¢). 

(b) Find the autocorrelation function of Y(t). 

(c) What is the ensemble average power of the output? 


A short-term integrator is defined by the input-output relationship 


t-T 


(a) Find the impulse response of this system. 

(b) Find the power spectrum of Y(t) in terms of the power spectrum of X(0). 
[Hint: Write the answer to (a) in terms of the rectangular function first.] 

Let X(t) be WSS, and let Y(‘) be the output resulting from the application to X(f) 

of an LTI system with impulse response /A(f) and transfer function H(/). 
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37. 


38. 
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Define a new random process as the difference between input and output: 

D(t) =X(t) — YO. 

(a) Find an expression for the autocorrelation function of D(f) in terms of Ryy 
and h. 

(b) Determine the power spectral density of D(A), and verify that your answer is 
real, symmetric, and nonnegative. 

An amplitude-modulated waveform can be modeled by the expression 

A(t)cos(100at+@©)+N(t), where A(t) is WSS and has autocorrelation 

function Ry,(t) = 80sinc7(10z); ©~Unif(—x, x] and is independent of 

A(t); and N(f) is band-limited white noise, independent of A(¢) and ©, with 

Swf) = 0.05 W/Hz for |f1< 100 Hz. To filter out the noise, we pass the 

waveform through an ideal bandpass filter with transfer function H(f) = 1 for 

40 <lfl< 60. 

Let X() = A(cos(100zt+ ©), the signal part of the input. 

(a) Find the autocorrelation of X(t). 

(b) Find the ensemble average power in X(f). 

(c) Find and graph the power spectral density of X(‘). 

(d) Find the ensemble average power in the signal part of the output. 

(e) Find the ensemble average power in N(‘). 

(f) Find the ensemble average power in the noise part of the output. 

(g) Find the power signal-to-noise ratio of the input and the power signal-to- 
noise ratio of the output. Discuss what you find. 

A random signal X(t) incurs additive noise N(f) in transmission. The signal and 

noise components are independent and WSS, X(f) has autocorrelation function 

Rxx(t) = 250,000 + 120,000cos(70,00027) + 800,000sinc(100,0007), and N(A) 

has power spectral density Syy(f)= 2.5x 10~* W/Hz for | f1< 100 kHz. To 

filter out the noise, we pass the input X(t) + M(t) through an ideal lowpass filter 

with transfer function H(f) = 1 for | fl < 60 kHz. 

(a) Find the ensemble average power in X(f). 

(b) Find and sketch the power spectral density of X(¢). 

(c) Find the power spectral density of L[X()]. 

(d) Find the ensemble average power in L[X(A)]. 

(e) Find the ensemble average power in N(f). 

(f) Find the ensemble average power in L[N(4]. (Think about what the power 
spectral density of L[N(t)] will look like.) 

(g) Find the power signal-to-noise ratio of the input and the power signal-to- 
noise ratio of the output. Discuss what you find. 

Let X(t) be a pure white noise process with psd N/2. Consider an LTI system 

with impulse response /(f), and let Y(¢) denote the output resulting from passing 

X(t) through this LTI system. 

(a) Show that Ryy(z) = “8A(z). 

(b) Show that Py = MO En, where E;, is the energy in the impulse response 


function, defined by E, = { h?(t)dt. 
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8.3 Discrete-Time Signal Processing 


Recall from Sect. 7.4 that a random sequence (i.e., a discrete-time random process) 
X, is said to be wide-sense stationary if (1) its mean, yyx[n], is a constant wy and 
(2) its autocorrelation function, Ryy[n, n +k], depends only on the integer-valued 
time difference k (in which case we may denote the autocorrelation Ry,[k]). 
Analogous to the Wiener—Khinchin Theorem, the power spectral density of a 
WSS random sequence is given by the discrete-time Fourier transform of its 
autocorrelation function: 


Sxx (F) => S- Ryy[kje ?™"* (8.6) 


We use parentheses around the argument F in Eq. (8.6) because Syx(F) is a 
function on a continuum, even though the random sequence is on a discrete index 
set (the integers). Similar to the continuous case, it can be shown that Syy(F) is a 
real-valued, nonnegative, symmetric function of F’. (The choice of capital F will be 
explained toward the end of this section.) 

Power spectral densities for random sequences differ from their continuous-time 
counterparts in one key respect: the psd of a WSS random sequence is always a 
periodic function, with period 1. To see this, recall that e’ onk — 1 for any integer k, 
and write 


+00 +00 
Sxx(F +1) = S- Ryx|kje Pree — S- Ry [ke V28F ke Pak 
k=—oo k=—0o 


=> S- Rxx[kle ?™* => Sxx(F) 


k=—0o 


As a consequence, we may recover the autocorrelation function of a WSS 
random sequence from its power spectrum by taking the inverse Fourier transform 
of Syx(F) over an interval of length 1: 


1/2 
Rxx[k] =| _ Sxx(F)e?™**dF (8.7) 


This affects how we calculate the power in a random sequence from its power 
spectral density. Analogous to the continuous-time case, we define the (ensemble) 
average power of a WSS random sequence X,, by 
1/2 1/2 
Syx(F)e?™*O dF = | Syx(F)dF 


Py = E(X?[n]) = Rxx[0] = | 1/2 


1/2 
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That is, the expected power in a random sequence is determined by integrating 
its psd over one period, not the entire frequency spectrum. 


Example 8.9 Consider the Bernoulli sequence of Sect. 7.4: the X,, are iid Bernoulli 
rvs, a stationary sequence with py =p, Cxx[0] = Var(X,,) = pC — p), and Cyx[k] = 
0 for k# 0. From these, the autocorrelation function is 


k=0 
Revie] = Coxlll +8 =| $55 


In particular, Py = Rxx[0] =p. To determine the power spectral density, apply 
Eq. (8.6): 


+00 
Sxyx(F) = DS Ryy[kje?* = Ryy [0] eaF0) ifs S— Rxx[kje PP" 
ca k#-0 
+00 
=p +pS> eAFk — y+ p? S- e PaFk _ 12 9—j2aF(0) 
k#0 (S55 
+00 ; 
= p(l —p) +p? S- e /2aFk 
k=—00 


Engineers will recognize this last summation as an impulse train (sometimes 
called a sampling function or Dirac comb), from which we have 


Sex(F) = pl —p) +? 5 (Fn) 


n=—Oo 


A graph of this periodic function appears in Fig. 8.12; notice it is indeed a 
nonnegative, symmetric, periodic function with period 1. Since it’s sufficient to 
define the psd of a WSS random sequence on the interval (— 1/2, 1/2), we could drop 
all but one of the impulses and write Syy(F) = p( — p) +p5(F) for —1/2<F < 1/2. 

For a more general iid sequence with E[X,] =py and Var(X,,) = Ox’, a similar 
derivation shows that Syy(F)= ort us tro 4 8(F —n), or ox t+ ux(F) for 
—1/2<F< 1/2. In particular, if X,, is a mean-zero iid sequence, the psd of X,, is 
just the constant ce 


Syy(F) 


a 


Fig. 8.12 Power spectral density of a Bernoulli sequence (Example 8.9) a 


0 
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Example 8.10 Suppose X,, is a WSS random sequence with power spectral density 
Sxx(F) = tri2F) for —1/2 <F < 1/2. Let’s determine the autocorrelation function 
of X,,. 

The psd may be rewritten as Syy(F') = (1 — 2IFl) for —1/2 << F < 1/2, which is 
shown in Fig. 8.13a. Apply Eq. (8.7): 


1/2 . 1/2 | 
Rxx[k] = | Syx(F)e*"kaF = | (1 — 2|F|)e?*"*ar 
1/2 “ap 
1/2 
= | (1 — 2|F|) cos(2aFk)dF (since 1 — 2|F| is even) 
1/2 


1/2 
= 2| (1 — 2F) cos(2nFk)dF (since the intergrand is even) 
0 


For k=0, this is a simple polynomial integral resulting in Ry y[0] = 1/2, 
which equals the area under Syx(F), as required. For k 40, integration by 
parts yields 


_ 1—cos(nk) _ f 2/(n?k?) k odd 
Rxx |k] _ mk _ { 0 keven 


The graph of this autocorrelation function appears in Fig. 8.13b. 


a Syy F) b RyylA] 
ry 
0.54 
0.254 
F <1 4 4 4 +e k 
05 05 6 4 2 0 2 4 6 


Fig. 8.13 Graphs for Example 8.10: (a) Power spectral density; (b) autocorrelation function 


8.3.1 Random Sequences and LTI Systems 


A discrete-time LTI system L has similar properties to those described in the 
previous section for continuous time. If we let 5[n] denote the Kronecker delta 
function—i.e., 5[0] = 1 and 6[n] = 0 for n 4 0—then a discrete-time LTI system is 
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characterized by an impulse response* function h[n] defined by A[n] =L[6[n]]. If 
we let X,, denote the input to the LTI system and Y,, the output, so that Y,, = L[X,], 
then Y,, may be computed through discrete-time convolution: 


Y, =X,kh{n] = S> Xh[n—K = So Xp-ah[K] 


k=—00 k=—oo 


Discrete-time LTI systems can be characterized in the frequency domain by a 
transfer function H(F’), defined as the discrete-time Fourier transform of the 
impulse response: 


H(F)= 5 h{nje 


n=—OoO 


This transfer function, like the power spectral density, is periodic in F with 


period 1. The properties of the output sequence Y,, are similar to those for Y(f) in the 
continuous-time case. 


PROPOSITION 

Let L be an LTI system with impulse response h[n] and transfer function 
H(F). Suppose X,, is a wide-sense stationary sequence and let Y,, = L[X,,], the 
output of the LTI system applied to X,. Then Y,, is also WSS, with the 
following properties. 


Time domain Frequency domain 

1. wy = py ‘3 h{n] 1. py=H,-H(0) 

2. Ryy[k] = Ryxlk | ehLE KHL —K] 2. Syy(F) = Sxx(F) - lH(F)? 
3. Py=Ryy[0] 3. Py= ie Syy (F)dF 


Example 8.11 A moving average operator can be used to “smooth out” a noisy 
sequence. The simplest moving average takes the mean of two successive 
terms: Y, = (X,_,+X,,)/2. This formula is equivalent to passing the sequence X,, 
through an LTI system with an impulse response given by h[0]=A[1] = 1/2 and 
h[{n] =0 otherwise. The transfer function of this LTI system is 


“In this context, the Kronecker delta function is also commonly called the unit sample response, 
since it is strictly speaking not an impulse (its value is well defined at zero). It does, however, share 
the two key properties of a traditional Dirac delta function (i.e., an impulse): it equals zero for all 
non-zero inputs, and the sum across its entire domain equals 1. 
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H(F) = S- A{nje~7Fn = gee 4 


n=—o 


from which the power transfer function is 


2 
1+ ea 1+ cos(2aF) — jsin(22F) 
2 2 


JA(F)P = 


(1+ cos(2xF))? + sin?(2zF) __ 1+ cos(2aF) 
i = 2 


Notice that the function (1 + cos(2aF))/2 is periodic with period 1, as required. 
Suppose X,, is a WSS random sequence with power spectral density Syy(F’) = No 
for IF < 1/2, as depicted in Fig. 8.14a. Then the moving average Y,, has psd equal to 


1+ cos(2zF) 


Syy(F) = Sxx(F) - |H(F)|° = No - 5 


The graph of this power spectral density appears in Fig. 8.14b. The ensemble 
average power in Y,, can be determined by integrating this function from —1/2 to 
1/2: 


1/2 1/2 
N N N 
Py= | Syy(F)dF = a | [1 + cos (2nF)|dF = 5 (1) = Si 
-1/2 -1/2 
a Syy(F) b Syy(F) 
A A 
No No 
r > FF > FF 
-0.5 0 0.5 -0.5 0 0.5 


Fig. 8.14 Power spectral density of the moving average in Example 8.11 a 
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8.3.2 Random Sequences and Sampling 


Modern electronic systems often work with digitized signals: analog signals that 
have been “sampled” at regular intervals to create a digital (i.e., discrete-time) 
signal. Suppose we have a continuous-time (analog) signal X(t), which we sample 
every T, seconds; T, is called the sampling interval. That is, we only observe x(f) at 
times 0, +7,, +27,, and so on. Then we can regard our observed (digital) signal as a 
random sequence X[n] defined by 


X[n| = X(nT,) forn=..., —2, —1,0,1,2,... 


This is illustrated for a sample function in Fig. 8.15. 

The following proposition ensures that the sampled version of a WSS random 
process is also WSS—and, hence, that the spectral density theory presented in this 
chapter applies. 


PROPOSITION 
Let X(4) be a WSS random process, and for some fixed T,>0 define 
X[n] =X(nT,). Then the random sequence X[n] is a WSS random sequence. 


The proof was requested in Exercise 45 of Chap. 7. 

If the sampling interval is selected judiciously, then we may (in some sense) 
recover the original signal from the digitized version. This relies on a key result 
from communication theory called the Nyquist sampling theorem for determin- 
istic signals: If a signal x(t) has no frequencies above B Hz, then x(f) is completely 
determined by its sample values x[”] = x(nT,) so long as 


. i 

The quantity f, is called the sampling rate. The Nyquist sampling theorem says 
that a band-limited signal (with band limit B) can be completely recovered from its 
digital version, provided the sampling rate is at least 2B. For example, a signal with 
band limit B = 1 kHz= 1000 Hz must be sampled at least 2,000 times per second; 
equivalently, the sampling interval T, can be at most 1/(2B) = .0005 s. The mini- 
mum sampling rate, 2B, is sometimes called the Nyquist rate of that signal. 

When T, < 1/(2B), as required by the Nyquist sampling theorem, the original 
deterministic signal x(t) may be reconstructed by the interpolation formula 


si SO anita ( =) (8.8) 


n=—00 s 


The heart of the Nyquist sampling theorem is the statement that the two sides of 
Eq. (8.8) are equal. 
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Fig. 8.15 A smooth signal x(t) 
x(t) and its sampled version r 
x[n] (indicated by asterisks) 


0 i > ft 


For a band-limited random process X(t) with corresponding digital sequence 
X[n] =X(nT,), we may define a Nyquist interpolation of X(‘) by 


So . X{nJsinc (' =) 


n=—0oo s 


It can be shown that Xyyq() equals the original X(¢) in the “mean square sense,” 
1.e., that 


E| (Knya(t) — X()"] = 0 


(This is slightly weaker than saying Xnyq(t) = X(0); in particular, there may exist 
a negligible set of sample functions for which the two differ.) 

There is a direct connection between the Nyquist sampling rate and the argument 
F of the discrete-time Fourier transform. Suppose a random process X(t) has band 
limit B, i.e., the set of frequencies f represented in the spectrum of X(f) satisfies 
—B<f<B. Provided we use a sampling rate, f,, at least as great as the Nyquist rate 
2B, we have: 


-B<f<B,f,>2B > - 


If we define F=f/f,, we have a unitless variable whose set of possible values 
exactly corresponds to that of F in the discrete-time Fourier transform. Said 
differently, F in the discrete-time Fourier transform represents a normalized fre- 
quency; we can recover the spectrum of X(¢) across its original frequency band by 
writing f= F - f,. (In some textbooks, you will see the argument of the discrete-time 
Fourier transform denoted Q, to indicate radian measure. The variables F and Q are, 
of course, related by Q= 22F.) 
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8.3.3 Exercises: Section 8.3 (39-50) 


39. 


40. 


41. 


42. 


43. 


44. 


45. 


46. 


Let X[n] be a WSS random sequence. Show that the power spectral density of 
X[n] may be rewritten as 


Syx(F) = Ryx[0] + 2) Ryx[k] cos(2nkF) 
k=l 


Let X(t) be a WSS random process, and let X[n] = X(nT,), the sampled version 
of X(t). Find the power spectral density of X[n] in terms of the psd of X(d). 
Suppose X[n] is a WSS random sequence with autocorrelation function 
Ryylk] =a’ for some constant lal < 1. Find the power spectral density of 
X[n]. Sketch this psd for a= —.5, 0, and .5. 

Consider the correlated bit noise sequence described in Exercise 50 of 
Chap. 7: Xo is 0 or | with probability .5 each and, for n> 1, X,=X,_—1 with 
probability .9 and 1 — X,,_; with probability .1. It was shown in that exercise 
that X,, is a WSS random sequence with mean fy =.5 and autocorrelation 
function 


1+ .g/4 
4 


Rxx|k] = 


(This particular random sequence can be “time reversed” so that X,, is defined 
for negative indices as well.) Find the power spectral density of this correlated 
bit noise sequence. 

A Poisson telegraphic process N(t) with parameter 1= 1 (see Sect. 7.5) is 

sampled every 5 s, resulting in the random sequence X[n] = N(5n). Find the 

power spectral density of X[n]. 

Discrete-time white noise is a WSS, mean-zero process such that X,, and X,, 

are uncorrelated for all n 4m. 

(a) Show that the autocorrelation function of discrete-time white noise is 
Ryxlk] = 0° S[k] for some constant o >0, where S[k] is the Kronecker 
delta function. 

(b) Find the power spectral density of discrete-time white noise. Is it what 
you’d expect? 

Suppose X,, is a WSS random sequence with the following autocorrelation 

function: 


1 k=0 

: k odd 
Rul =4 xe *° 

0 otherwise 


Determine the power spectral density of X,,. [Hint: Use Example 8.10.] 

Let X,, be the WSS input to a discrete-time LTI system with impulse response 
h[n], and let Y,, be the output. Define the cross-correlation of X,, and Y,, by 
Ryy[n, n+ k] = E[X,Y pax]. 


8.3 


47. 


48. 


49. 


50. 
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(a) Show that Ryy does not depend on n, and that Ryy=Ryy * h, where * 
denotes discrete-time convolution. (This is the discrete-time version of a 
result from the previous section.) 

(b) The cross-power spectral density Syy(F) of two jointly WSS random 
sequences X,, and Y,, is defined as the discrete-time Fourier transform of 
Ryy[k]. In the present context, show that Syy(F)=Syx(F)H(F), where 
H denotes the transfer function of the LTI system. 

The WSS random sequence X, has power spectral density Syx(F') =2P for 

IF|< 1/4 and 0 for 1/4 <IFI< 1/2. 

(a) Verify that the ensemble average power in X,, is P. 

(b) Find the autocorrelation function of X,,. 

Let X,, have power spectral density Syx(F'), and suppose X,, is passed through a 

discrete-time LTI system with impulse response h[n] = a" for n=0, 1, 2, ... for 

some constant lal<1 (and h[m]=0 otherwise). Let Y,, denote the output 
sequence. 

(a) Find the mean of Y,, in terms of the mean of X,,. 

(b) Find the power spectral density of Y,, in terms of the psd of X,,. 

The system in Example 8.11 can be extended to an M-term simple moving 

average filter, with impulse response 


1/M n=0,1,...,.M—-1 
le { 0 otherwise 

Let X,, be the WSS input to such a filter, and let Y,, be the output. 
(a) Write an expression for Y,, in terms of the X,,. 
(b) Determine the transfer function of this filter. 
(c) Assuming X, is a discrete-time white noise process (see Exercise 44), 

determine the autocorrelation function of Y,,. 
A more general moving average process has the form 


for some integer M and constants Op, . . ., Oy. Let the input sequence X[n] be iid, 

with mean 0 and variance o°. 

(a) Find the impulse response h[n] of the LTI system that produces Y[n] from 
X[n]. 

(b) Find the transfer function of this system. 

(c) Find the mean of Y[n]. 

(d) Find the variance of Y[n]. 

(e) Find the autocovariance function of Y[m]. 
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A.1 Binomial cdf 


Table A.1 Cumulative binomial probabilities B(x; n, p) = 3 b(y3n, p) 
y=0 


(a)n=5 
Pp 
0.05 0.10 0.20 0.25 0.30 0.40 0.50 0.60 0.70 0.75 0.80 0.90 0.95 
0 774 590 .328 .237 168 .078 .031 010 002.001 .000 .000 = .000 
1 977 919 737 633 528 337 188 087 031  .016 .007  .000 ~ .000 
x 2 999 991 942 896 .837 683 500 317 163.104 =.058 = 009.001 
3 1.000 1.000 993 984 969 913 812 663 472 (367) .263 «081.023 
4 1.000 1.000 1.000 999 998 .990 969 .922 832  .763 .672 410 ~ .226 
(b) n= 10 Pp 
0.05 0.10 0.20 0.25 0.30 0.40 0.50 0.60 0.70 0.75 0.80 0.90 0.95 
0 599 349 107 056 028 006 001 .000 .000  .000 .000 = .000 = .000 
1 914 .736 376 244 149 046 O11 .002 000 .000 .000 .000 ~=.000 
2 988 .930 678 526 .383 167 055 012 002 .000 = .000 .000 ~=.000 
3 999 987 879 776 650 382 172 055 011.004 =.001_ = .000~—-.000 
4 1.000 998 .967 922 850 633 SUT .166 047, 020 =.006 =.000_ ~—.000 
x 5 1.000 1.000 994 980 953 834 623 367 150.078 = 033, .002,— 000 
6 1.000 1.000 999 996 989 945 828 618 350.224 =6.121_ = .013— «001 
7 1.000 1.000 1.000 — 1.000 998 988 945 833 617 474 322) 070.012 
8 1.000 1.000 1.000 1.000 1.000 998 989 954 851.756 =.624— 264 — 086 
9 1.000 1.000 1.000 1.000 1.000 — 1.000 999 994 972 =.944 = 893 65101 
(c)n=15 Dp 
0.05 0.10 0.20 0.25 0.30 0.40 0.50 0.60 0.70 0.75 0.80 0.90 0.95 
0 463 .206 .035 .013 005 .000 000 .000 000 .000 .000 §=.000 = .000 
1 829 549 167 080 035 005 000 .000 000 =.000 =.000 §=.000 ~=—.000 
2 .964 816 398 .236 127 027 004 000 000 .000 .000 .000 = .000 
3 995 944 648 461 297 091 018 002 000 = .000 = .000 §=.000 ~=—.000 
4 999 987 836 686 515 217 059 009 001 .000 .000 .000 = ©.000 
5 1.000 998 939 852 722 402 ASI 034 004 .001 .000 .000  .000 
6 1.000 000 982 943 869 610 304 095 015 .004 =.001 = =.000~=—-.000 
x 7 1.000 .000 .996 983 950 .787 500 213 050.017. .004. = 000 ~——.000 
8 1.000 000 999 .996 985 905 696 390 131.057.018.000 .000 
9 1.000 .000 =: 1.000 999 996 .966 849 597 278 ~=6=.148 ~— .061)~=— .002—S «000 
10 | 1.000 000 =1.000 1.000 999 991 941 -783 485 314 164 =.013— «001 
11 | 1.000 000 =1.000 =1.000 1.000 998 982 909 703, 539.352, 056 ~—.005 
12 | 1.000 000 = =1.000 1.000 1.000 1.000 .996 973 873 .764 =.602— 184 — «036 
13 | 1.000 000 =1.000 =1.000 1.000 1.000 1.000 995 965 920 833 451.171 
14 | 1.000 000 1.000 1.000 1.000 1.000 1.000 — 1.000 995 987 = .965. 794537 


(continued) 


(d) n=20 
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(e) n=25 
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Table A.1 (continued) 
P 
0.05 0.10 0.20) 0.25 0.30 0.400.580 0.60 0.70 0.75) 0.80 0.90 0.95 
358 =.122, 012, -«.003.—«001.—S 000) «000 «.000—S 000 «000.000.000.000 
736 392 069 024 008 001 000 000 000 000 000 000 000 
925 677 206 091 035 004 000 000 000 .000 000 000 000 
984 867 411 225 107 016 001 000 000 000 000 000 000 
997 957 630 415 .238 051 006 000 000 .000 000 000 000 
1.000 989 804 617 416 .126 .021 .002 000 .000 .000 .000  .000 
1.000 998 913 .786 608 .250 .058 006 .000 .000 .000 .000 000 
1.000 1.000 968 898 .772 416 132 .021  .001 .000 .000 .000  .000 
1.000 1.000 990 959 887 596 252. .057-— 005 001-000 .000-—-.000 
1.000 1.000 997 986 952 .755 412 128 .017 .004 001 000.000 
1.000 1.000 999 996 983 872 .588 .245 048 .014 .003 .000 000 
1.000 1.000 1.000 999 995 943 748 404 113 .041 .010 .000 —.000 
1.000 1.000 1.000 1.000 .999 979 868 584 228 .102 .032 .000 .000 
1.000 1.000 1.000 1.000 1.000 994 942 .750 .392 .214 .087 002 ~—.000 
1.000 1.000 1.000 1.000 1.000 .998 .979 874 584 383 .196 011 .000 
1.000 1.000 1.000 1.000 1.000 1.000 .994 949 .762 .585 370 .043 003 
1.000 1.000 1.000 1.000 1.000 1.000 .999 984 893 .775 589 133.016 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 .996 965 .909 .794 323.075 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 .999 992 .976 .931 608  .264 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .999 .997 .988 878  .642 
P 

0.05 0.10 = 0.20, 0.25 0.30 0.400.580 0.60 0.70 0.75) 0.80 0.90 0.95 
277 ~=.072,— «004 001.—Ss«000-—«w00-—=w00-s—=w00-s—=000-s«000-Ss «000s «000.000 
642.271 = «.027,—S «007.,—s«002-———s«000-S—«000S«000—S «000s «000.000.000.000 
873. 537, 098 032,s«009.——s«000-—«=w00—s—=w00-s—s=000-«s«000-Ss «000s «.000—s-.000 
966 .764 234 096 033 002 000 000 000 000 000 000 000 
993 902 421 214 090 009 000 000 000 .000 000 .000 000 
999 967 617 378 193 029 002 000 000 .000 000 000 000 
1.000 991 780 561 341 074 007. 000 = 000.000.000.000 ~—-.000 
1.000 998 891 727. 512.154 022, .001_~—S 000.000.000.000 _~—-.000 
1.000 1.000 953 851 677  .274 054 004. = .000 000.000 .000_~—-.000 
1.000 1.000 983 929 811 425 115 013.000 =.000 000.000 ~—-.000 
1.000 1.000 994 970 902 .586 .212 .034 .002 .000 .000 .000 —.000 
1.000 1.000 998 980 956 .732 345 078 .006 .001 .000 .000  .000 
1.000 1.000 1.000 997 .983 846 .500  .154 017.003.000.000 _—-.000 
1.000 1.000 1.000 999 994 922 655 .268  .044 .020 .002 .000 .000 
1.000 1.000 1.000 1.000 998 966 .788 414 .098 .030 .006 .000 .000 
1.000 1.000 1.000 1.000 1.000 987 885 575 =.189 071-017. .000-~—-.000 
1.000 1.000 1.000 1.000 1.000 .996 946 .726 323 .149 .047  .000 = .000 
1.000 1.000 1.000 1.000 1.000 .999 978 846 488 .273 .109 .002 .000 
1.000 1.000 1.000 1.000 1.000 1.000 .993 926 659 .439 .220 +009 ~—-.000 
1.000 1.000 1.000 1.000 1.000 1.000 .998 971 807 .622 .383 .033—.001 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 .991 910 .786 579.098 007 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 .998 967 .904 .766 .236 034 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .991 .968 .902 .463 127 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .998 .993 973 .729 358 
1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 .999 .996 .928 .723 
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A.2 Poisson cdf 


y=0 
Mu 
1 2 3 4 5 6 7 8 9 1.0 
0 905 819 741 670 607 549 497 449 407 368 
1 995 982 963 938 910 878 844 809 772 736 
2 1.000 999 996 992 986 977 966 953 937 920 
x 3 1.000 1.000 999 998 997 994 991 987 981 
4 1.000 1.000 1.000 999 999 998 996 
5 1.000 1.000 1.000 999 
6 1.000 


2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 15.0 20.0 


0 .135 050 018 007 002 .001 .000 000 .000 000.000 
1 406 199 092 040 017 007 003 001 .000 .000 000 
2 .677 423 .238 5125 062 .030 014 006 .003 .000 000 
3 857 647 433 .265 AS1 082 042 021 010 .000 = .000 
4 947 .815 629 440 .285 173 100 055 029 001  — .000 
5 983 916 .785 616 446 301 191 116 .067 003.000 
6 995 .966 889 762 606 450 313 .207 .130 008 = .000 
7 999 988 949 867 .744 599 453 324 .220 018 = .001 
8 1.000 996 979 932 .847 .729 593 456 333 037.002 
9 999 992 968 916 830 117 587 458 .070 = .005 
10 1.000 97 986 957 901 816 -706 583 118.011 
11 999 995 .980 947 888 803 697 185.021 
12 1.000 998 991 973 936 .876 192 .268 = .039 
13 999 996 987 .966 926 864 363.066 
14 1.000 999 994 983 959 Oy 466 = .105 
15 999 998 992 978 951 568  .157 
x 16 1.000 999 996 .989 973 664  .221 


= 
I 
— 
So 
So 
So 


998 995 .986 149.297 


18 999 998 993 819 381 
19 1.000 999 297 875.470 
20 1.000 998 17 6559 
21 999 947 644 
22 1.000 967.721 


23 981 — .787 
24 989 843 
25 994 888 
26 997 ~=.922 
27 998 948 
28 999 966 
29 1.000 = .978 


w 
S 


987 
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A.3 Standard Normal cdf 
M(z) = P(Z=2) 
Standard normal density curve 
Table A.3 Standard normal curve areas Shaded area = (2) 
0 z 
z .00 01 02 03 -04 05 06 07 -08 09 

3.4 0003 -0003 .0003 .0003 .0003 -0003 .0003 0003 .0003 .0002 
-3.3 .0005 .0005 .0005 0004 0004 .0004 .0004 0004 .0004 .0003 
-3.2 .0007 .0007 -0006 .0006 0006 .0006 .0006 0005 -0005 .0005 
-3.1 0010 .0009 0009 .0009 .0008 0008 0008 .0008 .0007 .0007 
-3.0 0013 .0013 0013 .0012 0012 .0011 0011 001 .0010 .0010 
-2.9 .0019 .0018 .0017 .0017 .0016 .0016 .0015 .0015 .0014 .0014 
2.8 0026 .0025 .0024 .0023 .0023 .0022 0021 002 .0020 .0019 
-2.7 .0035 .0034 0033 .0032 0031 .0030 .0029 .0028 .0027 .0026 
2.6 .0047 .0045 0044 0043 0041 .0040 .0039 0038 .0037 .0036 
2.5 .0062 .0060 0059 .0057 .0055 .0054 .0052 005 .0049 .0048 
2.4 0082 .0080 .0078 .0075 .0073 .0071 .0069 0068 .0066 .0064 
2.3 .0107 .0104 .0102 .0099 .0096 .0094 0091 0089 .0087 .0084 
2.2 0139 .0136 .0132 .0129 0125 .0122 0119 0116 0113 .0110 
2.1 .0179 .0174 .0170 .0166 .0162 .0158 0154 0150 .0146 0143 
2.0 0228 .0222 .0217 .0212 .0207 .0202 0197 0192 0188 0183 
-1.9 .0287 .0281 .0274 .0268 0262 .0256 0250 0244 .0239 .0233 
-18 0359 .0352 0344 .0336 .0329 .0322 .0314 0307 .0301 .0294 
-1.7 0446 .0436 .0427 0418 .0409 .0401 .0392 0384 .0375 .0367 
-1.6 0548 .0537 .0526 .0516 .0505 .0495 0485 .0475 .0465 .0455 
-1.5 .0668 .0655 0643 .0630 0618 .0606 .0594 0582 .0571 .0559 
-1.4 0808 .0793 .0778 .0764 .0749 .0735 .0722 .0708 .0694 .0681 
-1.3 0968 .0951 .0934 0918 .0901 .0885 0869 0853 0838 .0823 
-1.2 1151 1131 1112 1093 .1075 .1056 1038 1020 -1003 .0985 
-1.1 .1357 .1335 1314 1292 1271 1251 1230 1210 .1190 .1170 
-1.0 1587 -1562 1539 1515 1492 .1469 1446 1423 1401 1379 
-0.9 1841 .1814 1788 1762 1736 A711 1685 1660 -1635 1611 
-0.8 2119 -2090 2061 2033 2005 .1977 1949 1922 .1894 .1867 
0.7 .2420 2389 2358 2327 2296 2266 2236 2206 2177 2148 
-0.6 2743 2709 2676 2643 2611 2578 2546 2514 2483 2451 
-0.5 3085 3050 3015 2981 2946 2912 2877 2843 2810 2776 
-0.4 3446 3409 3372 3336 3300 3264 3228 3192 3156 3121 
0.3 3821 3783 3745 3707 3669 3632 3594 3557 3520 3482 
0.2 4207 4168 4129 4090 4052 4013 3974 3936 3897 3859 
-0.1 4602 4562 4522 4483 4443 4404 4364 4325 4286 4247 
-0.0 5000 4960 4920 4880 4840 A801 A761 4721 4681 4641 


(continued) 
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Table A.3 (continued) 


Zz .00 01 02 03 04 .05 06 07 08 09 
0.0 5000 5040 5080 5120 5160 5199 5239 5279 5319 5359 
0.1 5398 5438 5478 WoL? 5557 5596 5636 5675 5714 5753 
0.2 5793 5832 5871 5910 5948 5987 .6026 .6064 6103 6141 
0.3 6179 6217 6255 6293 6331 6368 6406 6443 6480 6517 
0.4 6554 6591 .6628 6664 .6700 .6736 6772 6808 6844 6879 
0.5 6915 6950 6985 -7019 -7054 .7088 .7123 7157 7190 .7224 
0.6 7257 .7291 .7324 7357 .7389 7422 7454 7486 .TS17 7549 
0.7 -7580 G11 7642 7673 7104 7734 .7764 7794 .7823 7852 
0.8 -7881 7910 7939 -7967 7995 .8023 8051 8078 8106 8133 
0.9 8159 8186 8212 8238 8264 8289 8315 8340 8365 8389 
1.0 8413 8438 8461 8485 8508 8531 8554 8577 8599 8621 
1.1 8643 .8665 8686 8708 8729 8749 8770 8790 8810 8830 
1.2 8849 8869 8888 8907 8925 8944 8962 .8980 8997 9015 
1.3 9032 9049 9066 -9082 9099 9115 9131 9147 9162 177 
14 9192 9207 9222 9236 92351 9265 9278 9292, 9306 9319 
15 9332 9345 9357 -9370 9382 9394 9406 9418 9429 9441 
1.6 9452 .9463 9474 9484 9495 9505 9515 9525 9535 9545 
1.7 9554 9564 573 9582 9591 9599 9608 .9616 9625 9633 
18 9641 9649 9656 9664 9671 9678 9686 9693 9699 9706 
1.9 9713 9119 9726 9732 9738 9744 9750 9756 9761 9767 
2.0 9772 9778 9783 9788 9793 9798 9803 9808 9812 9817 
21 9821 .9826 9830 9834 9838 9842 9846 9850 9854 9857 
2.2 9861 9864 9868 9871 9875 9878 9881 9884 9887 9890 
2.3 9893 9896 9898 9901 .9904 9906 9909 9911 9913 9916 
2.4 9918 9920 9922. 9925 9927 9929 9931 9932 9934 9936 
2.5 9938 9940 9941 9943 9945 9946 9948 9949 9951 9952. 
2.6 9953 9955 9956 9957 2959 9960 9961 9962 9963 9964 
2.7 9965 9966 9967 9968 9969 9970 9971 9972 9973 9974 
2.8 9974 9975 9976 29T7 2977 9978 9979 9979 9980 9981 
2.9 9981 9982 9982 9983 9984 9984 9985 9985 9986 9986 
3.0 9987 9987 9987 9988 9988 9989 9989 9989 9990 9990 
3.1 9990 9991 9991 9991 9992. 9992 9992 9992. 9993 9993 
3.2 9993 9993 9994 9994 9994 9994 9994 9995 9995 9995 
33 9995 9995 -9995 9996 9996 9996 9996 .9996 9996 997 
3.4 9997 9997 9997 9997 9997 9997 9997 9997 9997 9998 
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A.4 Incomplete Gamma Function 
e 
Table A.4 The incomplete gamma function G(x; «) = | —— y* le dy 
0 
a 
1 2 3 4 5 6 7 8 9 10 
632 .264 .080 019 004 .001 .000 000 .000 .000 
.865 594 323: 143 053 017 .005 001 .000 .000 
950 801 S77 353 185 084 034 012 .004 .001 
982 908 762 567 il .215 lL 051 021 .008 
993 .960 875 135: 560 384 238 133 068 032 
998 983 938 849 715 554 394 .256 153 084 
999 993 .970 918 827 699 550 401 271 .170 
1.000 997 986 958 900 .809 687 547 407 .283 
999 994 979 945 884 793 .676 544 413 
1.000 997 990 971 933 .870 -780 .667 542 
999 995 985 .962 921 857 768 659 
1.000 998 992 980 954 911 .845 .758 
999 996 989 974 946 900 834 
1.000 998 994 986 968 938 891 
999 997 992 982 963 930 


= 
on 
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A.5 Critical Values for t Distributions 


Central area t density curve 


Table A.5 Critical values for t distributions 


0 


—t critical value t critical value 
Central area 

v 80% 90% 95% 98% 99% 99.8% 99.9% 
1 3.078 6.314 12.706 31.821 63.657 318.31 636.62 
2 1.886 2.920 4.303 6.965 9.925 22.326 31.598 
3 1.638 2.353 3.182 4.541 5.841 10.213 12.924 
4 1.533 2.132 2.776 3.747 4.604 7173 8.610 
5 1.476 2.015 2.571 3.365 4.032 5.893 6.869 
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959 
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408 
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041 
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781 
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587 
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437 
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318 
13 1.350 1.771 2.160 2.650 3.012 3.852 4.221 
14 1.345 1.761 2.145 2.624 2.977 3.787 4.140 
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073 
16 1.337 1.746 2.120 2.583 2.921 3.686 4.015 
17 1.333 1.740 2.110 2.567 2.898 3.646 3.965 
18 1.330 1.734 2.101 2.552 2.878 3.610 3.922 
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883 
20 1.325 1.725 2.086 2.528 2.845 3.552 3.850 
21 1.323 1.721 2.080 2.518 2.831 3.527 3.819 
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792 
23 1.319 1.714 2.069 2.500 2.807 3.485 3.767 
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745 
25 1.316 1.708 2.060 2.485 2.787 3.450 3.725 
26 1.315 1.706 2.056 2.479 2.779 3.435 3.707 
27 314 1.703 2.052 2.473 2.771 3.421 3.690 
28 1.313 1.701 2.048 2.467 2.763 3.408 3.674 
29 311 1.699 2.045 2.462 2.756 3.396 3.659 
30 1.310 1.697 2.042 2.457 2.750 3.385 3.646 
32 309 1.694 2.037 2.449 2.738 3.365 3.622 
34 1.307 1.691 2.032 2.441 2.728 3.348 3.601 
36 1.306 1.688 2.028 2.434 2.719 3.333 3.582 
38 304 1.686 2.024 2.429 2.712 3.319 3.566 
40 1.303 1.684 2.021 2.423 2.704 3.307 3.551 
50 1.299 1.676 2.009 2.403 2.678 3.262 3.496 
60 1.296 1.671 2.000 2.390 2.660 3.232 3.460 
120 | 1.289 1.658 1.980 2.358 2.617 3.160 3.373 
oe) 1.282 1.645 1.960 2.326 2.576 3.090 3.291 
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A.6 Tail Areas of t Distributions 


Table A.6 tcurve tail areas 


t curve 


733 


Area to the 


right of ¢ 


0 t 
t 
Degrees of Freedom (v) 

t 1 2 3 4 5 6 7 8 9 10 11 12 
0.0 500 500 500 500 500 500 500 500 500 500 500 500 
0.1 468 465 463 463 462. 462 462 461 461 461 461 461 
0.2 437 430 427 426 425 424 424 423 423 423 423 422 
0.3 407 396 392 390 388 387 386 386 386 385 385 385 
0.4 379 364 358 355 353 352 351 350 349 349 348 348 
0.5 352 333 326 322 319 317 316 315 315 314 313 313 
0.6 328 305 295 .290 .287 .285 284 283 .282 281 .280 .280 
0.7 306 278 267 261 258 255 253 252 251 .250 249 249 
0.8 285 254 241 .234 .230 227 225 223 222 221 .220 .220 
0.9 267 .232 217 .210 205 201 199 197 196 195 194 193 
1.0 .250 211 196 187 182 178 175 173 172 .170 169 169 
11 235 193 176 .167 162 157 154 152 150 149 147 146 
12 221 177 158 148 142 138 135 132 130 129 128 127 
13 .209 162 142 132 125 121 LT. 115 113 11 110 109 
1.4 197 148 128 117 110 106 102 100 098 096 095 093 
15 187 136 1S 104 .097 092 089 086 084 082 081 080 
1.6 178 125 104 092 085 080 .077 074 072 .070 069 068 
1.7 169 116 094 082 075 .070 065 064 062 .060 059 .057 
1.8 161 107 085 073 066 061 .057 055 053 051 050 049 
19 154 099 .077 065 058 053 050 047 045 043 042 041 
2.0 148 092 .070 058 051 046 043 .040 038 037 035 034 
21 141 085 063 052 045 .040 037 034 033 031 .030 029 
2.2 136 079 058 046 040 035 032 029 028 026 025 024 
2.3 131 074 052 041 035 031 .027 025 023 022 021 .020 
2.4 126 069 048 .037 031 027 024 022 .020 019 018 017 
2.5 121 065 044 033 .027 023 020 018 017 016 015 014 
2.6 117 061 040 .030 024 .020 018 016 014 013 012 012 
2.7 113 057 037 .027 021 018 015 014 012 O11 010 010 
2.8 109 054 .034 024 019 016 013 012 010 009 009 008 
2.9 106 051 031 022 O17 014 O11 .010 009 008 007 007 
3.0 102 048 029 .020 015 012 010 009 .007 007 006 006 
3.1 099 045 .027 018 013 O11 009 007 006 006 .005 .005 
3.2 096 043 025 016 012 009 008 006 .005 .005 004 004 
3.3 094 .040 023 015 O11 008 007 .005 .005 004 004 .003 
3.4 091 038 021 014 010 .007 006 005 004 003 003 .003 
3.5 089 .036 .020 012 009 006 005 .004 003 .003 002 002 
3.6 086 035 018 O11 008 006 004 004 .003 002 002 002 
3.7 084 033 O17 .010 007 .005 004 003 002 002 002 002 
3.8 082 031 016 010 006 004 003 003 002 002 .001 001 
3.9 .080 .030 O15 009 006 004 003 002 002 001 001 001 
4.0 078 029 014 008 005 .004 003 002 002 001 .001 001 
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Table A.6 (continued) 


Degrees of Freedom (v) 

t 13 14 15 16 17 18 19 20 21 22 23 24 
0.0 .500 .500 .500 .500 .500 500 500 500 .500 500 .500 500 
0.1 461 461 461 461 461 461 461 461 461 461 461 461 
0.2 422 422 422 422 422 422 422 422 422 422 422 422 
0.3 384 384 384 384 384 384 384 384 1384 383 .383 383 

8 347 347 347 347 347 347 347 347 347 346 346 
0.5 313 312 312 312 312 312 311 311 311 311 311 311 

9 279 279 278 278 278 .278 278 .278 277 277 277 

8 .247 .247 .247 .247 .246 .246 .246 .246 .246 .245 .245 
0.8 .219 218 218 218 217 217 217 217 216 216 .216 .216 

2 

8 

6 


191 191 191 190 190 190 .189 .189 .189 189 .189 
167 167 166 166 .165 165 .165 164 164 164 164 
144 144 144 143 143 143 142 142 142 141 141 
1.2 126 124 124 124 123 .123 122 122 122 121 121 121 
1.3 108 107 107 106 105 .105 105 104 104 104 103 103 
1.4 092 091 091 .090 .090 089 089 089 088 088 087 087 
1.5 .079 077 O77 077 .076 .075 075 .075 074 074 074 073 
1.6 .067 .065 .065 .065 064 064 063 .063 062 062 062 .061 
1.7 056 055 055 054 054 .053 053 052 052 052 051 051 
1.8 048 046 046 045 045 044 044 .043 043 043 042 042 
1.9 .040 038 038 038 037 037 036 .036 .036 .035 035 .035 
2.0 .033 032 032 .031 .031 .030 .030 .030 029 029 029 028 
2.1 028 027 027 .026 025 .025 025 024 024 024 023 023 
2.2 023 022 022 021 021 021 020 .020 .020 019 019 019 
2.3 019 018 018 018 017 O17 016 016 016 016 015 .015 
2.4 016 015 015 014 014 014 013 .013 013 013 012 012 
2.5 013 012 012 012 011 O11 O11 O11 .010 010 010 010 
2.6 011 .010 010 010 009 .009 009 .009 .008 .008 008 .008 
2.7 .009 008 .008 008 008 007 007 007 .007 .007 006 .006 
2.8 .008 007 .007 006 .006 .006 006 .006 .005 .005 .005 .005 
2.9 .006 .005 005 .005 005 .005 005 004 .004 .004 .004 .004 
3.0 .005 004 004 004 004 004 004 004 .003 .003 003 .003 
3.1 .004 .004 004 003 003 .003 003 .003 .003 .003 .003 002 
3.2 .003 .003 003 003 003 .002 002 .002 .002 002 002 002 
3.3 .003 002 002 002 002 002 002 .002 002 002 002 .001 
3.4 .002 002 002 002 002 .002 002 .001 .001 .001 001 .001 
3.5 002 002 002 .001 001 001 001 .001 .001 .001 .001 .001 
3.6 .002 001 .001 001 001 .001 .001 .001 .001 .001 .001 .001 
3.7 .001 001 .001 .001 001 .001 001 001 001 .001 .001 .001 
3.8 .001 001 001 001 001 .001 .001 .001 .001 .000 000 .000 
3.9 001 .001 001 001 001 .001 .000 .000 .000 .000 .000 .000 
4.0 .001 001 001 001 000 .000 .000 .000 .000 .000 000 .000 
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Table A.6 (continued) 
Degrees of Freedom (v) 

t 25 26 27 28 29 30 35 40 60 120 (= z) 
0.0 500 500 500 500 500 500 -500 .500 500 500 500 
0.1 A61 A461 A61 461 461 A61 460 460 460 460 460 
0.2 422 422 A421 A421 421 A21 A421 A421 A421 421 A421 
0.3 383 383 383 383 383 383 383 383 383 382 382 
0.4 346 346 346 346 346 346 346 346 345 345 345 
0.5 311 311 311 310 310 310 310 310 309 309 309 
0.6 277 277 277 217 277 277 276 276 275 215 274 
0.7 .245 245 .245 245 245 245 244 244 243 243 242 
0.8 216 215 215 215 215 215 215 214 213 213 212 
0.9 188 188 188 188 188 188 187 187 186 185 184 
1.0 163 163 163 163 163 163 162 162 161 .160 159 
11 141 141 141 140 140 140 139 139 138 137 136 
12 121 120 120 120 120 120 119 119 eLt7 116 115 
13 103 103 102 102 102 102 101 101 099 098 097 
14 087 087 086 086 086 086 085 085 083 082 081 
15 073 073 073 .072 072 072 071 071 069 068 067 
1.6 061 061 061 060 .060 .060 059 059 057 056 055 
17 051 051 .050 050 050 050 049 048 047 046 045 
18 042 042 042 041 041 041 040 .040 038 037 036 
19 035 .034 034 034 .034 034 033 032 031 .030 029 
2.0 028 028 028 028 .027 027 .027 .026 025 024 023 
2.1 023 023 023 022 022 022 022 021 020 019 018 
2.2 019 018 018 018 018 018 O17 017 016 O15 014 
2.3 015 O15 015 O15 014 014 014 013 012 012 O11 
2.4 012 012 012 012 012 O11 O11 O11 010 009 008 
25 .010 010 009 009 009 009 009 008 008 007 006 
2.6 008 008 007 007 007 .007 007 .007 006 005 .005 
2.7 006 006 006 006 006 006 005 .005 004 004 003 
2.8 005 005 .005 005 005 004 004 004 003 003 003 
2.9 004 004 004 004 004 003 003 003 003 002 002 
3.0 003 003 003 003 003 003 002 002 002 002 001 
3.1 002 002 002 002 002 002 002 002 001 001 001 
3.2 002 002 002 002 002 002 001 001 001 001 001 
3.3 001 001 001 001 001 001 001 001 001 001 .000 
3.4 001 001 001 001 001 001 001 001 001 000 .000 
3.5 001 001 001 001 001 001 001 001 000 000 .000 
3.6 001 001 001 001 001 001 000 000 000 000 000 
3.7 001 001 .000 000 000 .000 000 .000 000 000 000 
3.8 000 000 000 000 000 .000 000 000 000 000 000 
3.9 .000 000 .000 000 000 000 000 .000 000 000 .000 
4.0 000 000 .000 000 000 000 000 .000 000 000 000 


Appendix B: Background Mathematics 


B.1 Trigonometric Identities 


b) = cos(a) cos(b) — sin(a) sin(b) 
cos(a — b) = cos(a) cos(b) + sin(a) sin(b) 
sin(a + b) = sin(a) cos(b) + cos(a) sin(b) 

— b) = sin(a) cos(b) — cos(a) sin(b) 
cos(a) cos(b) = “| cos(a + b) + cos(a — b)| 
sin(a) sin(b) = %4[ cos(a — b) — cos(a+ b)| 


B.2 Special Engineering Functions 


20) 
BS ei 


So 


rect(x) = 


1 |x| <0.5 
|x| > 0.5 
ig 
-l -0.5 0 0.5 1 


(continued) 


M.A. Carlton and J.L. Devore, Probability with Applications in Engineering, Science, and 737 
Technology, Springer Texts in Statistics, DOI 10.1007/978-1-4939-0395-5, 
© Springer Science+Business Media New York 2014 


738 Appendix B: Background Mathematics 


win {'shl Hs 
x 
sin(mx) 
sine(x) = mx  ~ ae 
1 x=0 


B.3 o(h) Notation 
The symbol o(/) denotes any function of 4 which has the property that 


__ o(ht) 
a 

Informally, this property says that the value of the function approaches 0 even 
faster than h approaches 0. 

For example, consider the function f(h)=h*. Then f()/h=h*, which does 
indeed approach 0 as A — 0. On the other hand, f(h) = Vh does not have the o(h) 
property, since f(1)/h = 1/Vh, which approaches oo as h— 0*. Likewise, sin(h) 
does not have the o(/) property: from calculus, sin(h)/h— 1 ash— 0. 

Note that the sum or difference of two functions that have this property also has 
this property: o(h) + o(h) = o(h). The two o(h) functions need not be the same as 
long as they both have the property. Similarly, the product of two such functions 
also has this property: o(/)- o(h) = o(h). 


B.4 The Delta Function 


The Dirac delta function, 5(x), also called an impulse or impulse function, is such 
that 5(x) = 0 for x 40 and 
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li 8(x)dx = 1 


—oo 


More generally, an impulse at location x9 with intensity a is a-65(x— Xo). An 
impulse is often graphed as an arrow, with the intensity listed in parentheses, as in 
the accompanying figure. The height of the arrow is meaningless; in fact, the 
“height” of an impulse is +00. 


ry 


(a) 


Properties of the delta function: 
as 


Basic integral: | 5(x)dx = 1, so| a3(x — Xo)dx =a 


foe} 


Antiderivative: | 5(t)dt = u(x) 


: Ce 
Rescaling: 8(cx) = a for c~0 
é 


Sifting: [- g(x)5(x — x9)dx = g(x) 


Convolution: g(x) ®& B(x — Xp) = BX — Xo) 


B.5 Fourier Transforms 


The Fourier transform of a function g(t), denoted 7 {g(t)} or G(f), is defined by 
Gi =F{e)}=| ee Pea 
where j = /—1. The Fourier transform of g(f) exists provided that the integral of 


g(t) is absolutely convergent; i.e., | |g(2) |dt < oO. 


740 


The inverse Fourier transform of a function G(f), denoted 7 


g(t), is defined by 


e(t) = F“{G(} = | 
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“{G(f)} or 


© G(fet*af 


—Ooo 


Properties of Fourier transforms: 


Linearity: 
Rescaling: 
Duality: 
Time shift: 


Frequency shift: 


Time convolution: 


Frequency convolution: 


Fourier transform pairs: 


g(t) 
1 


u(t) 
cos(27 fot) 


sin(27 fot) 


Ke “u(t), a>0,k=0,1,2,... 


—altl 


e”~,a>0 


a 
et 


rect(t) 
tri(t) 


F {aye (t) + ango(t)} = aGi(f) + aGr(f) 
oe 
{ea} = T6(2) 


Fat} = Gf) => 
F{g(t—to)} = 
FX g(the'} — G(f — fo) 

FA8,(t)e2()} = Gi(f)Gr(/) 
Fig (tg: 


AGI yale!) 


G(f Je —j2nfto 


a(t) } = Gi(f)*G2(f) 


G(f) 
— 


580 + ap 


518 fo) + 8/440) 


1 

oF [5 — fo) — 8(F + fo)] 
k! 

(a + jan fy)! 

2a 

a + (2nf) 

Ve 

sinc( f ) 

sinc?( f) 


B.6 Discrete-Time Fourier Transforms 


The discrete-time Fourier transform (DTFT) of a function g[n] is defined by 


G(F) = 


Co 


SS g[nje* Fn 


n=—OoO 
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The DTFT of g[n] exists provided that g[m] is absolutely summable; i-e., 
d= lelall < 00. 


n=—OO 


The inverse DTFT of a function G(F) is defined by 


1/2 
g[n| = | G(Fyet?" "dF 
-1/2 


Properties of DTFTs: (an arrow indicates application of the DTFT) 


Periodicity: G(F +m) = G(F) for all integers m; i.e., G(F) has period 1 
Linearity: a12\[n] + a2g2[n] — a,G,(F) + arG2(F) 
Time shift: g[n — no] > G( Fern 
Frequency shift: g|nje?**o" _, GF — Fo) 
Time convolution: gi[n] ® g2[n] — G,(F)G2(F) 
1/2 
Frequency convolution: gilnlgs|n] > [ P ON NONES ONE 


(periodic convolution of G; and G2) 


DTFT pairs: 
gin] G(F) 
1 8(F) 
5[n] 1 
1. 1 
ad Freres 
cos(2F on) slr Fo) + 8(F + Fo)] 
sin(2nF on) (SF Fo) — 5(F + Fo)| 
ae 
al" lal<1 ald 


1+ a? — 2acos (2xF) 
1 


n 
nha ——— 
a Sarat 


Appendix C: Important Probability 
Distributions 


C.1 Discrete Distributions 


For discrete distributions, the specified pmf and cdf are valid on the range of the 
random variable. The cdf and mgf are only provided when simple expressions exist 
for those functions. 


Binomial (n, p) X ~Bin(n, p) 
range: {0, 1,..., 2} 
parameters: n, n=O, 1, 2, ... (number of trials) 

p, 0<p <1 (success probability) 

n\ nox 

pmf: b(x;n,p) = ("ye (1 —p) 
cdf: B(x; n, p) (see Table A.1) 
mean: np 
variance: np(1 — p) 
megf: (1—p+pe'y" 


Note: The n= 1 case is called a Bernoulli distribution. 


Geometric (p) 


range: (1.2). 35. c4 
parameter: p,0<p <1 (success probability) 
pmnf: p—py" 
cdf: 1-—(1—p)* 

1 
mean: a 

P 
variance: ee 

P 
a 

megf: ee Le 

1—(1—p)e! 


Note: Other sources defined a geometric rv to be the number of failures preced- 
ing the first success in independent and identical trials. See Sect. 2.6 for details. 
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Hypergeometric (n, M, N) X ~ Hyp(n, M,N) 
range: {max(0,n -N+M),..., min(n, M)} 
parameters: n,n=O, 1, ..., N (number of trials) 
M,M=0, 1,..., N (population number of 
successes) 
N, N=1, 2, 3, ... (population size) 
less, 
x n—Xx 
pmf: h(x;n, M,N) @ 
n 
cdf: A(x; n, M, N) 
mean: n- M 
N 
variance: n- aa (1 q) : ae 
N N} N-1 


Note: With the understanding that (5) =0 for a<b, the range of the 


hypergeometric distribution can be simplified to {0, ..., n}. 


Negative Binomial (r, p) X ~NB(r, p) 
range: {r,r+1,r+2,...} 
parameters: r,r=1,2,... (desired number of successes) 
p,0<p <1 (success probability) 
pmf: nb(x;n,p) = (F2 i \pr —p)y" 
Fa 
mean: a 
Pp 
“(1 — 
variance: r( 2 P) 
ae pe) 
i 1-—(1-p)e 


Notes: The r= 1 case corresponds to the geometric distribution. 

Other sources defined a negative binomial rv to be the number of failures 
preceding the rth success in independent and identical trials. See Sect. 2.6 for 
details. 


Poisson (1) 

range: {0, 1, 2,...} 

parameter: HM, 4 > 0 (expected number of events) 
e kur 

pm: pisit) = 

cdf: P(x; p) (see Table A.2) 

mean: HM 

variance: Mu 


mef: enle—1) 
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C.2 Continuous Distributions 


For continuous distributions, the specified pdf and cdf are valid on the range of the 
random variable. The cdf and mgf are only provided when simple expressions exist 
for those functions. 


Beta (a, B, A, B) 

range: [A, B] 

parameters: a, a> 0 (first shape parameter) 
B, B > 0 (second shape parameter) 
A, —0o0 <A <B (lower bound) 
B, A<B< ow (upper bound) 


pdf: 1 (a+) eS 


B—A T(a)-T(p) \B—A =A 
eae Ay 
mean: a+B 
2 
variance: eae ia 


(a+ py (a+fB+1) 


Notes: The A=0, B= 1 case is called the standard beta distribution. 
The a= 1, P=1 case in the uniform distribution. 


Exponential (A) 


range: (0, oo) 
parameter: A, A> 0 (rate parameter) 
pdf: dew 
cdf: 1-e* 
1 
mean: = 
A 
: 1 
variance: > 
a 
megf: A 
——— t<a 
A-t 


Note: A second parameter y, called a threshold parameter, can be introduced to 
shift the density curve away from x=0. In that case, X —y has an exponential 
distribution. 


Gamma (a, f) 


range: (0, 00) 
parameters: a, a> 0 (shape parameter) 
B, B > 0 (scale parameter) 
1 
df: —__—- xt le X/B 
r Pa) 
cdf: G (G : a) (see Table A.4) 


(continued) 


746 Appendix C: Important Probability Distributions 


mean: ap 
variance: ap 
1 a 
: t<l 
met Cer ad 


Notes: The a= 1, = 1/4 case corresponds to the exponential distribution. 

The /= 1 case is called the standard gamma distribution. 

The a=n (an integer), 6 = 1// case is called the Erlang distribution. 

A third parameter y, called a threshold parameter, can be introduced to shift the 


density curve away from x= 0. In that case, X — y has the two-parameter gamma 
distribution described above. 


Lognormal (y, o) 


range: (0, 00) 
parameters: H, —0o <p’ < ow (first shape parameter) 
o, 6 > 0 (second shape parameter) 

pdf 1 -tns)-n?/ (202) 

20x 

In(x) — 
cdf: ® re 4 
o 

mean: otto /2 
variance: ete . (e* - 1) 


Note: A third parameter y, called a threshold parameter, can be introduced to 
shift the density curve away from x= 0. In that case, X — y has the two-parameter 
lognormal distribution described above. 


Normal (y,o) [or Gaussian (y, o)] X~N(,o) 
range: (—o0, 00) 
parameters: [, —CO <ps<oo (mean) 
o, o > 0 (standard deviation) 
1 42 
df: eo -#) /(20°) 
P ovV2n 
cdf: o(= = *) (see Table A.3) 
oO 
mean: Mu 
variance: o 
mef: olttot /2 


Note: The »=0, o= 1 case is called the standard normal or z distribution. 
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Uniform (A, B) X ~Unif[A, B] 
range: [A, B] 
parameters: A, —co <A <B (lower bound) 
B, A<B< ow (upper bound) 
1 
df: eee, 
, B-A 
cdf: eo2 
B-A 
A+B 
mean: a 
2 
2 
variance: (B— A)" 
12 
eBt > eft 
f: —_ t 
ue (B— Aji a 


Note: The A=0, B=1 


case is called the standard uniform distribution. 


Weibull (a, £) 
range: (0, 00) 
parameters: a, a > 0 (shape parameter) 
f, B > 0 (scale parameter) 
pdf: = xl (x/8)" 
cdf: 1 — e (/8)" 
mean: p-r(1+-) 
a 


— efr(s+2) (+9) 


Note: A third parameter y, called a threshold parameter, can be introduced to 
shift the density curve away from x= 0. In that case, X — y has the two-parameter 
Weibull distribution described above. 


C.3 Matlab and R Commands 


Table C.1 indicates the template for Matlab and R commands related to the 
“named” probability distributions. In Table C.1, 

x = input to the pmf, pdf, or cdf 

p = left-tail probability (e.g., » =.5 for the median, or .9 for the 90th percentile) 
N= simulation size; i.e., the length of the vector of random numbers 

pars = the set of parameters, in the order prescribed 

name =a text string specifying the particular distribution 
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Table C.1 Matlab and R syntax for probability distribution commands 


Matlab R 
pmf/pdf pdf (‘name’,x,pars) dname (x,pars) 
cdf cdf (’name’,x,pars) pname (x,pars) 
Quantile icdf(’name’,p,pars) qname (p,pars) 
Random #s random(’name’,pars, [N,1]) rname (N, pars) 


Table C.2 catalogs the names and parameters for a variety of distributions. 
Notice in Table C.1 that Matlab takes the name as a text string between single 
quotes, while R incorporates it into the command name. 


Table C.2 Names and parameter sets for major distributions in Matlab and R 


Matlab R 
Distribution name pars name pars 
Binomial bin Nn, p binom n, p 
Geometric* geo P geom 7) 
Hypergeometric hyge N,M,n hyper M,N-—M,n 
Negative binomial* nbin iF {@ nbinom r,p 
Poisson pois a pois uw 
Beta? beta a, B beta a, B 
Exponential exp 1/A exp A 
Gamma gamma a, B gamma a, \/p 
Lognormal logn Ho lnorm Ho 
Normal norm MH, o norm Ho 
Uniform unif A,B unif A,B 
Weibull wbl B,a weibull a, B 


“The geometric and negative binomial commands in Matlab and R assume that the random 
variable counts only failures, and not the total number of trials. See Sect. 2.6 or the software 
documentation for details. 

>The beta distribution commands in Matlab and R assume a standard beta distribution; i.e., with 
A=Oand B=1. 


Answers to Odd-Numbered Exercises 


Chapter | 
1. (a) AN B’ (b) AUB (c) (ANB!) U (BN A’) 
3. (a) f= {1324, 1342, 1423, 2314, 2341, 2413, 2431, 3124, 3142, 4123, 4132, 3214, 3241, 


13. 
15. 


4213, 4231} 

(b) A= {1324, 1342, 1423, 1432} 

(c) B= {2314, 2341, 2413, 2431, 3214, 3241, 4213, 4231} 

(d) AU B= {1324, 1342, 1423, 1432, 2314, 2341, 2413, 2431, 3214, 3241, 4213, 4231} 
ANB=@ 

A’ = {2314, 2341, 2413, 2431, 3124, 3142, 4123, 4132, 3214, 3241, 4213, 4231} 
(a) A= {SSF, SFS, FSS} 

(b) B= {SSS, SSF, SFS, SSS} 

(c) C= {SSS, SSF, SFS} 

(d) C'= {SFF, FSS, FSF, FFS, FFF} 

AUC={SSS, SSF, SFS, FSS} 

ANC={SSF, SFS} 

BUC={SSS, SSF, SFS, FSS} 

BNC={SSS, SSF, SFS} 

(@)) {HULL We, ia, aaah, ae, ssh, eh, sy, 1S), Bilih, wile, wiley, il, wae, 2), eM 
232233, Selo S35 321 3229323533332, 333) 

(b) {111, 222, 333} 

(@) (2S, EA, Wile), ail, SH, sil} 

(d) {111, 113, 131, 133, 311, 313, 331, 333} 

(a) §= {BBBAAAA, BBABAAA, BBAABAA, BBAAABA, BBAAAAB, BABBAAA, 
BABABAA, BABAABA, BABAAAB, BAABBAA, BAABABA, BAABAAB, BAAABBA, 
BAAABAB, BAAAABB, ABBBAAA, ABBABAA, ABBAABA, ABBAAAB, ABABBAA, 
ABABABA, ABABAAB, ABAABBA, ABAABAB, ABAAABB, AABBBAA, AABBABA, 
AABBAAB, AABABBA, AABABAB, AABAABB, AAABBBA, AAABBAB, AAABABB, 
AAAABBB} 

(b) {AAAABBB, AAABABB, AAABBAB, AABAABB, AABABAB } 

(a) .07 (b) .30 (c) .57 

(a) They are awarded at least one of the first two projects, .36 

(b) They are awarded neither of the first two projects, .64 

(c) They are awarded at least one of the projects, .53 

(d) They are awarded none of the projects, .47 

(e) They are awarded only the third project, .17 

(f) Either they fail to get the first two or they are awarded the third, .75 
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Answers to Odd-Numbered Exercises 


(a) .572 (b) .879 

(a) SAS and SPSS are not the only packages 
(b) .7 (c) .8 (d) .2 

(a) .8841 (b) .0435 

(a) .10 (b) .18, .19 (c) .41 (d) .59 (e) .31 (f) .69 
(a) 1/15 (b) 6/15 (c) 14/15 (d) 8/15 

(a) .85 (b) .15 (c) .22 (d) .35 

(a) 1/9 (b) 8/9 (c) 2/9 

(a) 10,000 (b) .9876 (c) .03 (d) .0337 

(a) 336 (b) 593,775 (c) 83,160 (d) .140 (e) .002 
(a) 240 (b) 12 (c) 108 (d) 132 (e) .55, .413 

(a) .0775 (b) .0082 

(a) 8008 (b) 3300 (c) 5236 (d) .4121, .6538 

(a) .2967 (b) .0747 (c) .2637 (d) .042 

(a) 369,600 (b) .00006494 

(a) 1/15 (b) 1/3 (c) 2/3 

P(AIB) > P(BIA) 

(a) .50 (b) .0833 (c) .3571 (d) .8333 

(a) .05 (b) .12 (c) .56, .44 (d) .49, .25 (e) .533 (f) .444, 556 
.04 

(a) .50 (b) .0455 (c) .682 (d) .0189 

(a) 3/4 (b) 2/3 

(a) .067 (b) .509 

(a) .765 (b) .235 

087, .652, .261 

.00329 

.4657 for airline #1, .2877 for airline #2, .2466 for airline #3 
A» and A; are independent 

.1936, .3816 

1052 

99999969, .226 

9981 

(a) Yes (b) No 

(a) .343 (b) .657 (c) .189 (d) .216 (e) .3525 

(a) P(A) = P(B) = .02, P(A MN B) = .039984, A and B are not independent 
(b) .04, very little difference 


(c) P(A NM B) =.0222, not close; P(A M B) is close to P(A)P(B) when the sample size is 
very small relative to the population size 

(a) Route #1 (b) .216 

(a) 1—(.— 1/N)" 

(b) n=3: .4212, 1/2; n=6: .6651, 1; n= 10: .8385, 10/6; the answers are not close 

(c) .1052, 1/9 =.1111; much closer 


(a) Exact answer = .46 (b) se & .005 
.8186 (answers will vary) 


(continued) 
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105. &.39, &.88 (answers will vary) 
107. &.91 (answers will vary) 
109. = .02 (answers will vary) 
111. (b)®.37 (answers will vary) (c) © 176,000,000 (answers will vary; exact = 176,214,841) 
113. (a)®.20 (b) © .56 (answers will vary) 
115. (a)®.5177 (b) = .4914 (answers will vary) 
117. &.2 (answers will vary) 
119. (o)1=4 - P(A) (numerical answers will vary) 
121. (a) 1140 (b) 969 (c) 1020 (d) .85 
123. (a) .0762 (b) .143 
125. (a) .512 (b) .608 (c) .7835 
127. .1074 
129. (a) 10'* (b) 7.3719 x 10°? 
131. (a) .974 (b) .9754 
13}3, | SVAG) 
135. (a) .018 (b) .601 
37h; | gllS6 
139. (a) .0625 (b) .15625 (c) .34375 (d) .014 
141. (a) .12, .88 (b) .18, .38 
143. 1/4=P(A; NA2M A3) ¥ P(A): P(A2) - P(A3) = 1/8 
145. (a) ag=0, a5 = 1 (b) ag = (1/2)a, + (1/2)az3 (c) a; =i/5 for i=0, 1, 2, 3, 4, 5 
149. (a) .6923 (b) .52 
Chapter 2 
il x=0 for FFF; x=1 for SFF, FSF, and FFS; x= 2 for SSF, SFS, and FSS; x =3 for SSS 
ae Z=average of the two numbers, with possible values 2/2, 3/2, ..., 12/2; W= absolute 
value of the difference, with possible values 0, 1, 2, 3, 4, 5 
5 No. In Example 2.4, let Y= 1 if at most three batteries are examined and let Y=0 
: otherwise. Then Y has only two values 
ds (a) {0, 1, 2..., 12}; discrete (c) {1, 2, 3, ...}; discrete (e) {0, c, 2c, ..., 10000c} where c 
is the royalty per book; discrete g {x: m <x <M} where m and M are the minimum and 
maximum possible tension; continuous 
9. (a) {2, 4, 6, 8, ...}, that is, {2(1), 2(2), 2(3), 2(4), ...}, an infinite sequence; discrete 
Ii. (a) -10(e)-45;, 25 
13. (a) .70 (b) .45 (c) .55 (d) .71 (e) .65 (f) .45 
15, 1,2), (1,3), 4), C5), (2,3), (2,4), 2,5), 3.4), 3.5), (4,5) (b) pO) =.3, pC) =.6, 
p(2)=.1 (c) F@) =0 for x <0, = .3 for0<x<1, = .9 for 1<x<2, and=1 forx>2 
17. (a) .81 (b) .162 (c) it is A; AVUUA, UVAUUA, UUAUA, UUUAA,; .00324 
19. p(0) = .09, p(1) = .40, p(2) = .32, p(3) =.19 
21. (b) p(x) = .301, .176, .125, .097, .079, .067, .058, .051, .046 for x= 1, 2,...,9 
(c) F(x) =0 for x <1, = .301 for 1<x <2, = .477 for2<x<3,..., = .954 for 
8<x<9, and=1 forx>9 
(d) .602, .301 
23. (a) .20 (b) .33 (c) .78 (d) .53 
25. (a) p(y) =(1 — py’ -p for y=0, 1, 2, 3,... 
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(a) 1234, 1243, 1324, ..., 4321 

(b) pO) = 9/24, p(1) = 8/24, p(2) = 6/24, p(3) =0, p(4) = 1/24 

(a) 6.45 GB (b) 15.6475 (c) 3.96 GB (d) 15.6475 

4.49, 2.12, .68 

(a) p (b) pl —p) ©) p 

E[h3(X)] = $4.93, E[h4(X)] = $5.33, so 4 copies is better 

E(X) = (n+ 1)/2, E(X’) = (n+ 1)(Qn + 1)/6, Var(X) = (n? — 1)/12 

(b) .61 (c) .47 (d) $2598 (e) $4064 

(a) « = —$2/38 for both methods (c) single number: o = $5.76; square: o = $2.76 

E(X —c)=E(X) —c, E(X — p) =0 

(a) .25, .11, .06, .04, .01 (b) «= 2.64, o= 1.54; for k=2, the probability is .04, and the 
bound of .25 is much too conservative; for k= 3, 4, 5, 10, the probability is 0, and the 
bounds are again conservative (c) «= $0, o = $d, 0 (d) 1/9, same as the Chebyshev 
bound (e) there are many, e.g., p(1) = p(—1) =.02 and p(0) = .96 

(a) Yes, n= 10, p= 1/6 (b) Yes, n= 40, p = 1/4 (c) No (d) No (e) No (f) Yes, assuming 
the population is very large; n= 15, p = P(a randomly selected apple weighs > 150 g) 
(a) .515 (b) .218 (c) .011 (d) .480 (e) .965 (f) .000 (g) .595 

(a) .354 (b) .115 (c) .918 

(a) 5 (b) 1.94 (c) .017 

(a) .403 (b) .787 (c) .774 

1478 

.407, independence 

(a) .010368 (c) the probability decreases, to .001970 (d) 1500, 259.2 

(a) .017 (b) .811, .425 (c) .006, .902, .586 


When p = .9, the probability is .99 for A and .9963 for B. If p=.5, the probabilities are 
.75 and .6875, respectively 


(a) 20, 16 (b) 70, 21 

(a) p=0 or | (b) p=.5 

P(X — pl > 20) = .042 when p=.5 and = .065 when p= .75, compared to the upper 
bound of .25. Using k= 3 in place of k= 2, these probabilities are .002 and .004, 
respectively, whereas the upper bound is .11 


(a) .932 (b) .065 (c) .068 (d) .492 (e) .251 

(a) .011 (b) .441 (c) .554, .459 (d) .945 

Poisson(5) (a) .492 (b) .133 

Pull, ces) 

(a) 2.9565, .948 (b) .726 

(a) .122, .809, .283 (b) 12, 3.464 (c) .530, .011 

(a) .221 (b) 6,800,000 (c) p(x; 20.106) 

(a) 1/11 — e~) (b) 0=2; .981 (c) 1.26 

(a) .114 (b) .879 (c) .121 (d) Use the binomial distribution with n= 15, p=.10 
(a) h(x; 15, 10, 20) for x=5, ..., 10 (b) .0325 (c) .697 
(a) h(x; 10, 10, 20) (b) .033 (c) h@; n, n, 2n) 

(a) .2817 (b) .7513 (c) .4912, .9123 

(a) nb(x; 2, .5) (b) .188 (c) .688 (d) 2, 4 

nb(x; 6, .5), 6 

nb(x; 5, 6/36), 30, 12.2 


(continued) 


Answers to Odd-Numbered Exercises 753 


105. (a) 160, 21.9 (b) .6756 

107. (a) .01e"+.05e!"+. 16e!'+.78e!~' (b) E(X) = 11.71, SD(X) = 0.605 

109. Mx(t) = e'(2 — e'), EX) =2, SD(X) = V2 

111. Skewness = —2.20 (Ex. 107), +0.54 (Ex. 108), +2.12 (Ex. 109), 0 (Ex. 110) 
113. E(X)=0, Var(X) =2 

115. p(y) =(.25P 1.75) for y = 1, 2, 3, ... 

117. My(t) = e/2, E(Y)=0, Var(Y) =1 

121. E(X)=5, Var(xX)=4 

123. M,_-x()=(p+(1—p)e'y” 

125. My(t)=p"[1— (1 —p)e|"’, EW”) =r — p)ip; Var(Y) =r(1 — p)ip? 

129. mean ~ 0.5968, sd 0.8548 (answers will vary) 

131. =&.9090 (answers will vary) 

133. (a) uw 13.5888, o + 2.9381 (b) }.1562 (answers will vary) 

135. mean % 3.4152, variance © 5.97 (answers will vary) 

137. (b) 142 tickets 

139. (a)&.2291 (b) © $8696 (c) + $7811 (d) © .2342, ~ $7,767, = $7,571 (answers will vary) 
141. (b) probability + .9196, confidence interval = (.9143, .9249) (answers will vary) 
143. (b) 3.114, .405, .636 

145. (a) b(x; 15, .75) (b) .686 (c) .313 (d) 11.25, 2.81 (e) .310 

147. (a) .013 (b) 19 (c) .266 (d) Poisson with ~ = 500 

149. (a) p(x 2.5) (b) .067 (c) .109 

SL, | Msi}, S05) 

153. p(2) =p”, p(3)= (1 —p)p*, p(4)= (1 — pp”, p@) = [1 — pQ) —...— p@&— 3) — p)p* 

for x=5, 6, 7, ...; .99950841 

155. (a) .0029 (b) .0767, .9702 

157. (a) .135 (b) .00144 (c) 3 °5 [p@; 2P 

159. 3.590 

161. (a) No (b) .0273 

163. (b) Spy +.5 po (Cc) .25(uy — Jo) + -5(44 + Hz) (d) .6 and .4 replace .5 and .5, respectively 
IGS, |= 5) 

167. 500p +750, 100,/p(1 — p) 

169. (a) 2.50 (b) 3.1 
Chapter 3 

ils (b) .4625; the same (c) .5, .278125 

3. (b) .5 (c) .6875 (d) .6328 

5, (a) k= 3/8 (b) .125 (c) .296875 (d) .578125 

We (a) f(x) = 1/4.05 for .20 <x < 4.25 (b) .3086 (c) .4938 (d) 1/4.05 

9. (a) .562 (b) .438, .438 (c) .071 

11. (a) .25 (b) .1875 (c) .4375 (d) 1.414 h (e) fx) = x/2 for0<x<2 

13. (a) k=3 (b+) F@)=1- 1/x° for x > 1 and=0 otherwise (c) .125, .088 

15. (a) F(X) =x°/8 for 0<x<2, =0 forx <0, = 1 forx>2 (b) .015625 (c) .0137, .0137 


(d) 1.817 min 
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(a) .597 (b) .369 (c) f(x) = [In(4) — In@)]/4 for 0<x<4 
(a) 1.333 h (b) .471 h (c) $2 

(a) .8182 ft* (b) .3137 

(a) A+(B—A)p (b) (A+B)/2 (c) (B™! —A™")/[(n + 1B — A) 
314.79 m* 

248 °F, 3.6 °F 

1/4 min, 1/4 min 

(c) Ur & V/20, or © v/800 (d) ~100z (e) ~80n7 

e(x) = 10x — 5, My(t) = (e* — e *)/101, Y ~ Unif[—5, 5] 
(a) My(t) =.15e>"(.15 — 1), w= 7.167, variance = 44.444 (b) .15/(.15 — 1), w= 6.67, 
variance = 44.444 (c) My(t) =.15/(.15 — t) 

(a) .4850 (b) .3413 (c) .4938 (d) .9876 (e) .9147 (f) .9599 (g) .9104 (h) .0791 (i) .0668 
Gj) .9876 

(a) 1.34 (b) —1.34 (c) .675 (d) —.675 (e) —1.555 

(a) .9664 (b) .2451 (c) .8664 

(a) .4584 (b) 135.8 kph (c) .9265 (d) .3173 (e) .6844 

(a) .9236 (b) .0021 (c) .1336 

.6826 < .9987 = the second machine 

(a) .2514, ~0 (b) ~39.985 ksi 

o=.0510 

(a) .8664 (b) .0124 (c) .2718 

(a) .7938 (b) 5.88 (c) 7.938 (d) .2651 

(a) ®(1.72) — ®(.55) (b) &(.55) — [1 — ®(1.72)] 

(a) .7286 (b) .8643, .8159 

(a) .9932 (b) .9875 (c) .8064 

(a) .0392 (b) ~1 

(a) .1587 (b) .0013 (c) .999937 (d) .00000029 

(a) 1 (b) 1 (c) .982 (d) .129 

(a) .1481 (b) .0183 

(a) 120 (b) (3/4)./x (c) .371 (d) .735 (e) 0 

(a) .424 (b) .567; median < 24 (c) 60 weeks (d) 66 weeks 
Np = —In(1 — p)/A, n = .693/2 

(a) .5488 (b) .3119 (c) 7.667 s (d) 6.667 s 

(a) .8257, .8257, .0636 (b) .6637 (c) 172.727 h 

(a) .9295 (b) .2974 (c) 98.184 ksi 

(a) w= 9.164, o = .38525 (b) .8790 (c) .4247 (d) no 

n =e" = 9547 kg/day/km 

(a) 3.96, 1.99 months (b) .0375 (c) .7016 (d) 7.77 months (e) 13.75 months (f) 4.522 
a=f 

(b) Pa@+f/)(m+ P/i[Tat+tm+pP)I, Ba + P) 

Yes, since the pattern in the plot is quite linear 

Yes 

Yes 


Plot In(x) versus z percentile. The pattern is somewhat straight, so a lognormal 
distribution is plausible 
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109. It is plausible that strength is normally distributed, because the pattern is reasonably linear 
111. There is substantial curvature in the plot. A is a scale parameter (as is o for the normal 
family) 
113. fry) =2/y? for y>1 
115. fy(y) = ye”? for y>0 
117. fry) = 1/16 for 0< y< 16 
119. fy) = I/[nU. + y?)] for co <y <0 
121. Y=g(X)=X7/16 
123. fy(y) = 1/[2y] for0<y<1 
125. /(4/7) O<y<l 
Ua 1/(8/y) 1<y<9 
0 otherwise 
129. (a) F(x) = x°/4, F | (u) = 2,/u (c) p= 1.333, o= 0.4714, X and s will vary 
131. The inverse cdf is F~'(u) = [V1 + 48u — 1] /3 
133. (a) The inverse cdf is F-'(u) =r: [1 — (1 —u)'”"] (b) E(X) = 16, X will vary 
135. (a) c= 1.5 (c) 15,000 (d) «= 3/8, X will vary (e) P (M < .1) = .8760 (answers will vary) 
137. (a)x=G |W) =—In(1 — w) (b) \/2e/n © 1.3155 (c) ~13,155 
141. (a) 4 (b) .6 (c) F(x) =2/25 for O<x < 25, = 0 for x <0, = 1 for x > 25 (d) 12.5, 7.22 
143. (b) F@)=1- 16/(x+4)* for x > 0, = 0 for x < 0 (c) .247 (d) 4 years (e) 16.67 
145. (a) .6568 (b) 41.56 V (c) .3197 
147. (a) .0003 (exact: .00086) (b) .0888 (exact: .0963) 
149. (a) 68.03 dB, 122.09 dB (b) .3204 (c) .7642, because the lognormal distribution is not 
symmetric 
151. (a) F(x) = 1.51 — 1/x) for 1< x <3, = 0 forx< 1, = 1 for x >3 (b) .9, .4 (c) 1.648 s 
(d) .553 s (e) .267 s 
153. (a) 1.075, 1.075 (b) .0614, .333 (c) 2.476 mm 
155. (b) $95,600, .3300 
157. (b) F(x) = .5e* for x <0, = 1— .5e~** for x >0 (c) .5, .665, .256, .670 
159. (a) k=(a—1)5*"1, a> 1 (b) F(x) = 1 — (5/x)*"! for x >5 (c) 5(a— 1) (a— 2), a>2 
161. (b) .4602, .3636 (c) .5950 (d) 140.178 MPa 
163. (a) Weibull, with a= 2 and i 20 (b) .542 
165. .5062 
171. (a) 710, 84.423, .684 (b) .376 
Chapter 4 
1. (a) .20 (b) .42 (c) .70 (d) px(x) = .16, .34, .50 for x= 0, 1, 2; py(y) =.24, .38, .38 for y=0, 
1, 2; .50 (e) no 
3. (a) .15 (b) .40 (c) .22 (d) .17, .46 (e) pi (x1) = .19, .30, .25, .14, .12 for x; =0, 1, 2, 3, 4 
(f) po(x2) = .19, .30, .28, .23 for x. =0, 1, 2, 3 (g) no 
Ds (a) .0305 (b) .1829 (c) .1073 
Te (a) .054 (b) .00018 
Y), (a) .030 (b) .120 (c) .300 (d) .380 (e) no 
11. (a) K=3/380,000 (b) .3024 (c) .3593 (d) f(x) = 10kx* +05 for 20 <x < 30 (e) no 
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13. 


15. 
17. 


19. 


All. 
Wh 


2B, 
Tle 
23), 
31. 
33. 
35) 
BM 
43. 
45. 
47. 
49. 
51. 
Sh 
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67. 
69. 
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73. 
Ws 


We 
12) 
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85. 
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89. 


Answers to Odd-Numbered Exercises 


eh -PuX ys ee eH He ot 

(a) p(x, y) = = (b) eo [1 + py + Hy] (©) Ti (H+ Ha) 

(a) f(x, y)=e * ® for x, y>0 (b) .400 (c) .594 (d) .330 

(a) F(y)=(-e *)+(1-e YY —-(1—e YY for y > 0, fly) =4ae” —3re7” for 
y>0 (b) 2/3A 


2 ope x2 
(a) .25 (b) Ux (©) 2/m (d) f(x) = 5 for —r SxS 1, fil) =f), no 


1/3 

(a) .11 (b) px(x) =.78, .12, .07, .03 for x=0, 1, 2, 3; py(y) =.77, .14, .09 for y=0, 1, 2 
(c) no (d) 0.35, 0.32 (e) 95.72 

AS 

LP 

25h, or 15 min 

8 

(a) —3.20 (b) —.207 

(a) .238 (b) .51 

(a) Var(h(X, Y)) = Jf Taw, y)P «fle, y)dA — [fC y) fle, yd AP (b) 13.34 

(a) 87,850, 19,100,116 (b) mean yes, variance no (c) .0027 

.2877, .3686 

.0314 

(a) 45 min (b) 68.33 (c) —1 min, 13.67 (d) —5 min, 68.33 

(a) 50, 10.308 (b) .0075 (c) 50 (d) 111.5625 (e) 131.25 

(a) .9616 (b) .0623 

(a) E(Y;) = 1/2, E(W) = n(n + 1)/4 (b) Var(¥;) = 1/4, Var(W) = n(n + 1)(2n + 1)/24 
10:52.76 a.m. 

(a) mean =0, sd = V2 

(a) X ~ Bin(10, 18/38) (b) Y~ Bin(15, 18/38) (c) X + Y~ Bin(25, 18/38) (f) no 

(a) a=2, B=1/A (c) gamma, a=n, P= 1/A 

(a) .5102 (b) .000000117 

(a) x7/2, x*/12 (b) fx, y) = 1p? for 0<y<x* <1 (c) fy(y) = 1/,/9 for0<y<1 
(a) px(x) = 1/10 for x=0, 1, ..., 9; pQlx) = 1/9 for y=0, ..., 9 and yx; p(x, y) = 1/90 
for x, y=0, 1, ..., 9 and yA x (b) 5—x/9 

(a) fy) = 2x, 0<x< 1 (b) fOlx) = 1/x, 0< y <x (c) .6 (d) no (e) x/2 (f) x-/12 


2! etree 
(a) p(x, y) = pia wap) CBS? *Y (b) X~Bin(2, .3), Y~ Bin(2, .2) 


(c) Y/X =x ~ Bin(2 — x, .2/.7) (d) no (e) (4 — 2x)/7 (f) 10(2 — x)/49 

(a) x/2, x7/12 (b) f(x, y) = 1/x for0< y<x<1 (c) fy) =—In(y) forO<y<1 

(a) .6x, .24x (b) 60 (c) 60 

176 lbs, 12.68 lbs 

(a) 1+4p, 4p — p) (b) $2598, 16,158,196 (c) 2598(1 + 4p), 

1/ 16518196 + 93071200p — 26998416p2 (d) $2598 and $4064 for p= 0; $7794 and 
$7504 for p =.5; $12,990 and $9088 for p=1 

(a) 12 cm, .01 cm (b) 12 cm, .005 cm (c) the larger sample 

(a) .9772, .4772 (b) 10 

43.29 h 
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91. .9332 

93. (a) .8357 (b) no 

95. (a) .1894 (b) .1894 (c) 621.5 gallons 
97. — (a) .0968 (b) .8882 


99. .9616 
101. 1/X 
il 2442 1 2 
103. @) fl1,9) = 2 C19)" &) fy, (1) = = e74* © yes 


4n J4n 
105. (a) y\2—y) forO<y< 1 (b) 211 —w) forO<w<1 
107. 4y3[In(y3)]° for O<y3<1 
111. (a) N(984, 197.45) (b) .1379 (c) 1237 
113. (a) N(158, 8.72) (b) N(170, 8.72) (c) .4090 
115. (a) .8875x+5.2125 (b) 111.5775 (c) 10.563 (d) .0951 
117. (a) 2x—10 (b) 9 (c) 3 (d) .0228 
119. (a) .1410 (b) .1165 
121. (a) R(t) =e" (b) .1054 (c) 2r (d) 0.886 thousand hours 
123. (a)R(=1—.125F for 0<t< 2, = 0 for t>2 (b) 37°/(8 — £°) (c) undefined 


a Stem a( lent 
125. (a) parallel (b) R(t) = 1— (1 —e! ) (c) A(t) = ae 


127. (a 1-A—R 0 — RO) — d — R3())0 — Ra) —  — Rs) — Ro (2))] 
(b) 70h 
129. (a) R(t) =e (-P/PA) for t<p, = eM” for t> B (b) f(t) = a(1 = 5) eat? /[261) 


133. (a) 5y*/10° for 0<y< 10, 8.33 min (b) 6.67 min (c) 5 min (d) 1.409 min 

135. (a) .0238 (b) $2,025 

137. nT(i+ 1/0) nT (i+ 2/0) nT(i+ 1/8) : 
(¢-— 1) (+ 1/641) G@- 1) (4+ 2/6 + 1) r 1)'N(u+ 1/04 7 

139. E(VYnd=7 

143. (b) P (X < 1, Y < 1) = 4154 (answers will vary), exact = .42 (c) mean ~ 0.4866, 
sd © 0.6438 (answers will vary) 

145. (b) 60,000 (c) 7.0873, 1.0180 (answers will vary) (d) .2080 (answers will vary) 

147. (a) fy(x) = 12x(1 — x°) for 0<x< 1, fly) = 2y/(1 — x)* for 0< y < 1 —x (c) we expect 
16/9 candidates per accepted value, rather than 6 

149. (a) px(100) = .5 and px(250) = .5 (b) p(yl100) = .4, .2, .4 for y=0, 100, 200; 
p(yl250) = .1, .3, .6 for y=0, 100, 200 

ISL. (a) NGa, 01), Nua + porfoilla — w1)], 02/1 — p*) 

153. (b) # = 196.6193 h, standard error = 1.045 h (answers will vary) (c) .9554, .0021 
(answers will vary) 


155. f(t)=e"" —e~' for t>0 

157. (a) k=3/81,250 (b) fx(x) = k(250x — 10x7) for 0< x < 20, = k(450x — 30x? + .5x°) for 
20 <x < 30; fY(y) =/X(y); not independent (c) .355 (d) 25.969 Ib (e) —32.19, —.894 
(f) 7.66 


159. t=E(X+Y)=1.167 
163. (c) p=1, because p< 1; p=2/3 < 1, because p> 1 
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165. (a) F(b, d) — F(a, d) — F(b, c)+ F(a, c) 
(b) F(10,6) — F(4,6) — F(10,1) + F(4,1); F(b, d) — F(a— 1, d) — F(b, c— 1)+F(a— 1, c— 1) 
(c) At each (x*, y *), F(x *, y *) is the sum of the probabilities at points (x, y) such that x < x* 
and y < y*. The table of F(x, y) values is 


xX 
100-250 
200.50 1 
y 100 .30 50 
0 .20 25 


(d) F(x, y) = .6x°y + 4xy3,0 <x <1,;0<y <1; F(x,y) =0,x <0; 
F(x,y) =0,y < 0; 
F(x,y) = 6° + .4x,0<x<1,y>1; 
F(x, y) = 6y+ 4y3,x>1,0<y<1,; 
iP(Ge,39)) = il, oe S 11,59 S Il 
YAS BS ID, LS SYS 1B) = PS 
(e) F(x, y) = 6x?y?,x+y<1,0<x<10<y<1x>0,y>0 
PlGe, 39) = ee = Che eee 2 By" = yp 4G = arte py = 1ae< il,p<i 
F(x,y) =0,x < 0; F(x, y) =0, y < 0; 
F(x,y) = 3x4 -8°+6°,0<x<1ly>1 
F(x,y) = 3y* — 8y3 + 6y*,0<y<1,x>1 
JP, 599)) = ize S> Igy SI 
167. (a) 2x, x (b) 40 (c) 100 
169. Undefined, +0 
2 
Ly asco e=roonne 


173. Not valid for 75th percentile, but valid for 50th percentile; sum of percentiles = (uw; + z0;) 
+ (ly +262) = Wy + fo + 2(61 +02), percentile of sums = (f; + fy) + 2/07 + 05 
175. (a) 2360, 73.7 (b) .9713 


177. .9686 

179. .9099 

181. .8340 

183. (a) Sim (b) .9999 
: Gre ar OF : 

185. 26, 1.64 


187. (a) 901, Yn) =2(n— ILFOn) — FOI" fon) for y1 <n 
(D) flr, wa) = n(n — DLF Oni + w2) — Fon] “Aw fw +2), 


Sw, (w2) = n(n — | [FQ + 2) — F(v1))" “fri fw + wo) dy 


(c) n(n— 1)w2" 71 = wW2) forO<w2<1 
191. (a) 10/9 (b) 10/8 (c) 1+ ¥2+...+Y10, 29.29 boxes (d) 11.2 boxes 
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Chapter 5 
ilk (a) ¥ = 113.73 (b) ¥ = 113 (c) s= 12.74 (d) .9091 (e) s/x = 11.2 
3s (a) ¥ = 1.3481 (b) ¥ = 1.3481 (c) ¥ + 1.285 = 1.7814 (d) .6736 
5. 6, = NX = 1,703,000, 6. =7— ND = 1,591,300, 6, — ros 1,601, 438.281 
We (a) 120.6 (b) 1,206,000 (c) .80 (d) 120 
9), (a) X; ¥ = 2.11 (b) p/n, .119 
11. (b) nA = 1) (©) n°7/(n — 1)°(n — 2) 
13. (0) VP. /M + Prga/n2 (©) with p , = x1 /m and py = X2/n2, V/P\q/M + PrG/n2 
(d) —.245 (e) .041 
17. (a) ¥ X7/2n (b) 74.505 
19. (b) .444 
21. (a) p =2Y/n — .3; .2 (c) (10/7)¥/n — 9/70 
a (a) wi wie pve!) ; d =; the MSE does not depend on p (b) when p is 
Vor (eee Aceon 
near .5, the MSE from part (a) is smaller; when p is near 0 or 1, the usual estimator 
has lower MSE 
25. (a) p = x/n = .15 (b) yes (c) .4437 
YW, |e Pp 
2D), | jo Salas = alld, yeas 
31. (a)@ = x 22 = 74.505, yes (b) 7 = /—2In(.5)@ = 10.163 
33. (@) 6 =—— — af (0 »» - — n, subject to 7 > max(x)) 
a 
“SG 
37. (a) 2.228 (b) 2.131 (c) 2.947 (d) 4.604 (e) 2.492 (f) ~2.715 
39. (a) Anormal probability plot of these 20 values is quite linear. (b) (23.79, 26.31) (c) yes 
41. (a) (57.38, 384.01) (b) narrower 
43. (a) Based on a normal probability plot, it is reasonable to assume the sample observations 
came from a normal distribution. (b) (430.51, 446.08); 440 is plausible, 450 is not 
45. Interval (c) 
47. 26.14 
49. (c) (12.10, 31.70) 
51. (a) yes (b) no (c) no (d) yes (e) no (f) yes 
53. Using H,: « < 100 results in the welds being believed in conformance unless proved 
otherwise, so the burden of proof is on the nonconformance claim 
55. (a) reject Ho (b) reject Ho (c) don’t reject Ho (d) reject Ho (e) don’t reject Ho 
57. (a) .040 (b) .018 (c) .130 (d) .653 (e) <.005 (f) ~.000 
59. (a) .0778 (b) .1841 (c) .0250 (d) .0066 (e) .5438 
61. (a) Ho: w= 10 versus H,: pw < 10 (b) reject Ho (c) don’t reject Ho (d) reject Ho 
63. (a) no; no, because n = 49 (b) Ho: «= 1.0 versus Hg: w < 1.0, z= —5.79, reject Ho, yes 
65. Ho: #=200 versus H,: w > 200, t= 1.19 at ll df, P-value = .128, do not reject Ho 
67. Ho: »=3 versus H,: uw A#3, t= —1.759, P-value = .082, reject Ho at a= .10 but not at 
a=.05 
69. Ho: 1=360 versus H,: uw > 360, t= 2.24 at 25 df, P-value = .018, reject Ho, yes 
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Ho: w= 15 versus H,: w < 15, z= —6.17, P-value + 0, reject Ho, yes 

Ho: o = .05 versus H,: 6 < .05. Type I error: Conclude that the standard deviation 

is <.05 mm when it is really equal to .05 mm. Type II error: Conclude that the standard 
deviation is .05 mm when it is really <.05 


Type I: saying that the plant is not in compliance when in fact it is. Type II: conclude that 
the plant is in compliance when in fact it isn’t 


(.224, .278) 

(.496, .631) 

(2252/75) 

(b) 342 (c) 385 

Ho: p= .15 versus H,: p > .15, z= 0.69, P-value = .2451, fail to reject Ho 

(a) Ho: p=.25 versus H,: p < .25, z= —1.01, P-value = .1562, fail to reject Ho: the winery 
should switch to screw tops (b) Type I: conclude that less than 25% of all customers find screw 


tops acceptable, when the true percentage is 25%. Type II: fail to recognize that less than 25% 
of all customers find screw tops acceptable when that’s actually true. Type II 

(a) Ho: p=.2 versus H,: p > .2, z= 1.27, P-value = .1020, fail to reject Hp (b) Type I: 
conclude that more than 20% of the population of female workers is obese, when the true 
percentage is 20%. Type II: fail to recognize that more than 20% of the population of 
female workers is obese when that’s actually true 


Ho: p=.1, Hy: p> .1, z=0.74, P-value © .23, fail to reject Ho 

Ho: p=.1 versus H,: p > .1, z= 1.33, P-value = .0918, fail to reject Ho; Type II 
Ho: p = .25 versus H,: p < .25, z= —6.09, P-value 0, reject Ho 

(a) Ho: p= .2 versus H,: p > .2, z= 0.97, P-value = .166, fail to reject Ho, so no 
modification appears necessary (b) .9974 

(a) Gamma(9, 5/3) (b) Gamma(145, 5/53) (c) (11.54, 15.99) 

B(490, 455), the same posterior distribution found in the example 

Gamma(a+ x;, 1/(n+1/B)) 

Beta(a+x, P+n—x) 

nl> kx, = .0436 

No: E(67) = 07/2 

=x; + 2y 

Xx; + 2n 


(a) The pattern of points in a normal probability plot (not shown) is reasonably linear, 
so, yes, normality is plausible. (b) (33.53, 43.79) 


(.1295, .2986) 


(a) A normal probability plot lends support to the assumption that pulmonary 
compliance is normally distributed. (b) (196.88, 222.62) 


(a) (539, .581) (b) 2401 


(a) expected payoff =0 (b) 6 


Xp + 1.964/ x? + x3 — (1.96)? 
oa o6) 

(a) 90.25% (b) at least 90% (c) at least 100(1 — ka)% 

(a) Ho: # = 2150 versus H,: p > 2150 (b) t = (x — 2150) /(s/./n) (c) 1.33 (d) .107 (e) fail 

to reject Ho 

Ho: f= 29.0 versus H,: w > 29.0, t= .7742, P-value = .232, fail to reject Ho 

Ho: p= 9.75 versus H,: «> 9.75, t=4.75, P-value + 0. The condition is not met. 


je 
(a) N(O, 1) (b) provided x,7 +x” > (1.96)? 
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131. Ho: #=1.75 versus Hz: pA 1.75, t= 1.70, P-value = .102, do not reject Ho; the data 
does not contradict prior research 
133. Ho: p=.75 versus H,: p < .75, z= —3.28, P-value = .0005, reject Ho 
135. (a) Ho: p < .02 versus H,: p > .02; with X ~ Bin(200, .02), P-value = P(X > 17) = 
75x 1077; reject Ho here and conclude that the NIST benchmark is not satisfied (b) .2133 
137. Ho: w=4 versus H,: w > 4, z= 1.33, P-value = .0918 > .02, fail to reject Ho 
Chapter 6 
tls {cooperative, competitive}; with 1 = cooperative and 2 = competitive, p;; =.6, pj2=.4, 
P22—=-1, pa —-3 
3. (a) {full, part, broken} (b) with 1 = fill, 2 = part, 3 = broken, py, =.7, pi2=.2, pi3=.1, 
P21 =, Por = 6, poz = .4, p31 = 8, P32 =0, p33 = 2 
5 (a) X; =2 with prob. p and =0 with prob. 1 — p (b) 0, 2, 4 
(Cy PU ei — 2p Ne (5 )e — p)*” for y=0, 1, ...,.x 
7. (a) Ason’s social status, given his father’s social status, has the same probability 
distribution as his social status conditional on all family history; no 
(b) The probabilities of social status changes (e.g., poor to middle class) are the same in 
every generation; no 
9. (a) no (b) define a state space by pairs; probabilities from each pair into the next state 
i, Vey es il (b) 8210, .5460 (c) .8031, .5006 
13. (a) Willow City: P(S — S) = .988 > .776 (b) .9776, .9685 (c) .9529 
15. (a) E 4| (b) .52 (c) .524 (d) .606 
17. (a) .2740, .7747 (b) .0380 (c) 2.1, 2.2 
19. 1439 .2790) = .2704 =.1747_—.1320 
2201 .3332 = .2522) =.1272 ~=.0674 
(a) | 1481 .2829 .2701 .1719 .1269 | (b) .0730 (c) .1719 
.0874 .2129 .2596 .2109 = .2292 
0319 .1099 .1893 .2174 .4516 
21. (a) .0608, .0646, .0658 (b) .0523, .0664, .0709, .0725 (c) they increase to .2710, .1320, 
.0926, .0798 
23 (@)e25n(b) E4372 
25. (a) ee (b) .778 0’s, .222 1’s (c) .7081 0’s, .2919 1’s 
05.95 
27. (a) t=[.80 .20] (b) P(X; =G) =.816, P(X; =S) =.184 (c) .8541 
29. (a) t=[0 1] (b) P(cooperative) = .3, P(competitive) = .7 (c) .39, .61 
31. (a) no (b) yes 
33. (a) (.3681, .2153, .4167) (b) .4167 (c) 2.72 
35), 


a2 @ 2 
(d) 8/15 (e) 5 


ae eee | 
(a) | O ‘ (b) P” has all nonzero entries (c) (8/15, 4/15, 1/5) 
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41. 


45. 


47. 


49. 


Sill, 


53. 


SP 
a), 


61. 
63. 
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(a) 9 =PK(a+t P), 1; =a/(a+t BP) (b) a=fh=05S the chain is constant; a= f = 1 > the 
chain alternates perfectly; a=0, / = 1= the chain is always 0; a= 1, 8 =0= the chain 
is always 1; a=0,0</< 1= the chain eventually gets stuck at 0; 0<a<1,/=0= the 
chain eventually gets stuck at 1;0<a<1landf=1 ora=1 and0<f/<1= the chain is 
regular, and the answers to (a) still hold 


00 [ (1—a)° a(1—a) a(1—a) ar 
o 01 | f-a) (1—a)(1—-Af) ap a(1 — f) 

10 | (1 — a) ap (1-a)(1-A) a(1—f) 

ne er A-B) BLA) (1B) 


"(a+ py’ (at+ py (at By 


0 


75 (b) .4219, .7383, .8965 (c) 4 (d) 1; no 


(a) states 4 and 5 


oy + ime be 4 5 6 7 8 9 10 
PU, <b|}0 46 .7108 .8302 .9089 .9474 .9713 .9837 .9910 .9949 
ot ie ees 4 5 6 y 8 9 10 
P= [0 46 2508 1194 .0787 .0385 0239 .0124 .0073 .0039 
we 3.1457 


(d) 3.2084 (e) .3814, .6186 


sas © © 0 © 
o 
(a)|.5 0 
Ss 
0 O 
(b) PTp < kK) =0 for k= 1, 2, 3; the probabilities for k=4, ..., 15 are .0625, .0938, 
1250, .1563, .1875, .2168, .2451, .2725, .2988, .3242, .3487, .3723 
(c) .2451 
(d) Po =k) =0 for k= 1, 2, 3; the probabilities for k=4, ..., 15 are .0625, .03125, 
.03125, .03125, .03125, .0293, .0283, .0273, .0264, .0254, .0245, .0236; w+ 3.2531, 
o & 3.9897 (e) 30 


Hcoop = 4-44, Ucomp = 3.89; cooperative 


0 @ 
5 0], 4 is an absorbing state 
QO s 
0 1 


ooouw 


0 1 0 0 0 0 
il || =p 0 P 0 0 
(a) 2 0 il=ja 0 pO 
3) 0 0 l1-—p 0 p 
4 0 0 0 0 1 
2 2 

(b) for x9 = $1, $2, $3: a 5 : ea sun aes 

2p? — 2p +1 2p? —2p+1° 2p? -2p+1 

P P a= jin eye 


for x9 =$1, $2, $3: 
OPS arrears) Soren 


3.4825 generations 

(c) (2069,0, 2079.8) (d) (.5993, .6185) (answers will vary) 

(a) P(Xn4a1 = 101X,, =x) = 4, P(X = 2x | X,, =x) = .6 (b) mean ~ $47.2 billion, sd + $2.07 
trillion (c) ($6.53 billion, $87.7 billion) (d) ($618.32 million, $627.90 million); easier 

(a) ($5586.60, $5632.3) (b) ($6695.50, $6773.80) (answers will vary) 
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65. 


67. 


69. 


ale 


73. 


Ws 


ie 


1, 


81. 


(b) .9224 (answers will vary) (c) (6.89, 7.11) (answers will vary) 
0 


(b) no (c) x=[! roL L 1) d)6@)9 


SeCUNCUO 

Stat oo 

Soest 
ale 


0 
0 
5 
0 
5 
0 0 0 0 
0 0 


Nie 


(a) (b) all entries of P® are positive 


j=) ed 


0 0 0 1 0 
(c) 1/12, 1/4, 1/4, 1/6, 1/6, 1/12 (d) 1/4 (e) 12 
0 


0 
_ (b) .0566, .1887, .1887, .1887, .1887, .1887 
0 


Sty aS SS 


0 
(c) 17.67 weeks (including the one week of shipping) 
(a) 2 seasons (b) .3613 (c) 15 seasons (d) 6.25 seasons 
(a) p; =[0.3168 0.1812 0.2761 0.1413 0.0846]; 
P2=[0.3035 0.1266 0.2880 0.1643 0.1176]; 
p3=[0.2908 0.0918 0.2770 0.1843 0.1561] 

(b) 35.7 years, 11.9 years, 9.2 years, 4.3 years 

(c) 16.6 years 


pe} 

a 

Ww 
Seagagts So 
Seeottu©S 
noecocor 


0/0 1 0 0 0 0 
1/0 0 959 0  .041 0 
210 0 O 987 013 O 

(a) 3/0 0 0 0 804.196 (c) 3.9055 weeks (d) .8145 
pd}0 0 O 0 1 0 


tbr|0 O O 0 0 1 
(e) payments are always at least 1 week late; most payments are made at the end of 3 weeks 


98 02 97 03 99 Ol 
Oh Se fe be a ie pak i ie al (Woe 


(a) [3259 22,533 19,469 26,066 81,227 16,701 1511 211,486 171,820 56,916] 
(b) [2683 24,119 21,980 27,015 86,100 15,117 1518 223,783 149,277 59,395]; 
(2261 25,213 24,221 27,526 89,397 13,926 1524 233,533 131,752 61,636]; 
—44%, 424%, +46%, +12%, +20%, —26%, +1.3%, +19%, +34%, +13.4% 

(c) [920 23,202 51,593 21,697 78,402 8988 1445 266,505 65,073 93,160] 
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Chapter 7 


ih (a) Continuous-time, continuous-space (b) continuous-time, discrete-space (c) discrete- 
time, continuous-space (d) discrete-time, discrete-space 

Th (b) No: at time t= .25, x0(.25) = —cos(z/2) = 0 and x,(.25) = cos(x/2) =0 
(c) X(0) = —1 with probability .8 and +1 with probability .2; X(.5) =+1 with probability 
.8 and —1 with probability .2 

9. (a) discrete-space (c) X,,~ Bin(n, 18/38) 

11. (a) 0 (b) 1/2 

Cyx(t, 8) = Var(A)cos(aot + 9o)cos(@5 + 9), Ryx(t, 5) = Vo" + VoE[A][cos(aot + Oo) + 

COS(@o5 + Ao)] + E[A7]cos(wot + 0p)cos(@os + Ao) 

15. —_(b) M(s) > 0, because covariance > 0 (c) p= e* (d) Gaussian, mean 0, variance 1.73 

19. (a) Hs) + Un (b) Rest, 8) + HsOpn(S) + Hn(Oes(S) + Runt, 5) 
(c) Css(t, 5) + Cwn(t, 8) (d) 08 (t) + 0% (1) 

23. (a) (1/2)sin(@po(s — t)) (b) not orthogonal, not uncorrelated, not independent 

25. (a) wy (b) ELV] + (Ao*/2)cos(wor) (c) yes 

27. (a) yes (b) no (c) yes (d) yes 


29. no 
31. Wa + Hp, Caa(t) + Can(t) + Cpa(t) + Caa(a), yes 
33. (a) yes, because its autocovariance has periodic components (b) —42 (c) 500cos(100z7) 


+ 8cos(600n7) +49 (d) 557 (e) 508 
35. yes: both the time average and ensemble average are 0 
37. Cxx(t)/Cxx(0) 
2n 
Al. .0062 (b) 75 + 25 sin | — 
(a) (b) 75 + 25 sin (z 


43. (a) 181/38, 360n/1444, 360 min(m, n)/1444, (360 min(m, n) +324mn)/1444 
(b) —10n/38, 36,000n/1444, 36,000 min(m, n)/1444, (36,000 min(n, m) + 100mn)/1444 


(59 — 150) (c) 168[7 — m] (d) no, and it shouldn’t be 


(c) .3141 
1 
nak (a) py (b) q (2Cxxln n| + Cyy|m — n+ 1] + Cyy|m—n — 1]) (c) yes 
(d) (Cxx[0] + Cxx[1])/2 
49. (c) wes (d) ene (e) yes (f) até 
l-a 1-—@ y 


53. (a) .0993 (b) .1353 (c) 2 
55. (a) .0516 (b) 1— ¥ 73.9 & 7°50*/x! (c) .9179 (d) 6 s (e) .8679 
57. KIA 

3 (ts) _ yom 
5, Wen ooh) 

(n — m)! 

Gil, |e“ 
63. fy(y)=24e (1 —e ”) for y>0 
67. (a) .0492 (b) .00255 


VA, pmf: M(t) =0 or 1 with probability 1/2 each for all t, mean=.5, variance = .25, 
Cut@ys 25 24) end OyS D5 DS OO 


73. (a) .0038 (b) .9535 
75. (a) yes (b) .3174 (c) .3174 (d) .4778 


77. (a) E[X(2)] 80 + 20cos (% (1 15)), Var(X()) =.2t (b) .1251 (c) 3372 (d) .1818 
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79. (a) .3078 (b) .1074 
81. (a) .1171 (b) .6376 (c) .0181, .7410 
83. (a) yes (b) E[X(t)] =0, Ryx(t, 5) = (No/2)min(t, 5), no 
(a) 0=empty, | =a person in stage 1, and 2 =a person in stage 2; qg=A, qi =A, 2 =a; 
87. qo2= 421 = G10= 93 or =4, 12 = M1, F200 = Az (b) HW = (6/11, 2/11, 3/11) 
(c) w= (6/11, 3/11, 2/11) (d) w= (1/7, 2/7, 4/7) 
(a) qo=4, 1 =A1, 92 = 42; Gor = G10 = 93 Gor =A, G12 = Ar, F20 = -8A2, dar = -2A2 
89. (b) m= (24/49, 10/49, 15/49) (c) m= (24/49, 15/49, 10/49) (d) w= (2/17, 5/17, 10/17) 
(e) 1.25(1/A, + 1/22) 
91. Gi= IP, Giiv1 = iP fori=1,...,.N—-1 
93. dii+1 =A for i> 0, gj ;-1 = if for i> 1, g;=A+ifP for i> 1 
a a a a 
Too = = >To. = vo 10 = oo M1 = a where X = af) + a1Bo + AP + Aofo; 
95. a aBo 
‘a aP, + aBo + aoP; + a0fo 
99. (a) 0 (b) Cxx(t, s) = 1 if floor(t) = floor(s), = 0 otherwise 
1 n 
101. (a) 0, (1/3)cos(@ 7) (b) 0, 324 COS (@xT), yes 
1 n 
103. (a) 0 (b) uy COS (@«T) - Pp; (C) yes 
k=1 
105. (a) S, denotes the total lifetime of the machine through its use of the first n rotors. 
(b) s[n] = 125n; os [n] = 15, 625n; Css[n, m] = 15,625 min(n, m); Rss[n, m] = 
15,625[min(n, m) + nm] (c) .5040 
107. Yes 
109. (a)e“ (b)e “(+41 (©) 
— e ato 
ll, | CS se 
ato a 
Chapter 8 
ile F {Rxx(t)} = since(f), which is not > 0 for all f 
vu wf 
3. (a) 2508(f) 4 5 exp ax 108 (b) 240.37 W (c) 593.17 W 
2 
5. (a) 112,838 W (b) 108,839 W (c) Ryy(t) = Sra exp(—10!7”) 
Ie (a) NoB (b) NoBsinc(2Br) 
2A AP 2Ae 
9. a) Age” 74"! (b) ———. —_ (c) Ao” (d) —“ arctan(2x 
(a) Aate” ™™ (0) a5 pape (© Ae? (@) —Parctan( 2a) 
11. (a) 10011 +e |) = 136.8 W (b) Ses [1 + cos (2xf)] (c) 126.34 W 
1+ (2nf) 5 
13. py() =0, Rww(t) = 2Rxx(t) — Rxx(t — d) — Rxx(t +d), Sww( f) = 2Sxx(P [1 — cos(2nfd)] 
15. (a) Yes (b) Yes (c) Szz(f) = Sxx(f) +Syv(f) 
17. (b) Sz f) = Sxx(f) ® Syv(f) 
19. No, because Py = 00 
21. (a) Sxx(f) = EIA" |Syv f) (b) S¥y(f) = EIA Sy f) — ElAluyy’6(f) (©) Yes; our 


“engineering interpretation” of the elements of a psd are not valid for non-ergodic processes 
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23s 
2B, 


Pale 


2), 


3B) 


35. 
Bio 


Al. 


43. 


45. 
47. 


49. 
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(a) 2400sin(120,000z) (b) 2400 W (c) 40/(40 + j2xf') (d) 32/(1600 + (2nf)2) for If < 60 kHz 
(e) 0.399997 W 


(a) 0 (b) (le P*)/( j2f) (c) (No/2)sine( f ) (d) No/2 


0 1 1 
(a) 1008(f) + ————,, (b) 125 W (c) ae 
1+ (2nf)° (4 + j2nf)” (16+ (anf) 
50 1 
(d) | 1008(f) 4 (ec) 0.461 W 
Le oa (16 ee (2nf?)” 
) 


N 
(a) (No/2)e~ 7" (b) 5 5-5 (©) Nola 


4a? + 

(a) 2Non’frect( f/2B) (b) = [2n° Br? sin (2nBr) + 2nBrcos (2nBr) — sin(2nBr)| 

(c) 4Nox?B?/3 

(a) Rxx(t) — Rxx(t) & h(t) — Rxx(t) & h(—1) + Rxx(t) & h(r) & h(—r) 

(b) Sxx(f Il — ACP 

(a) 1.17 MW (b) 250,0008(/) + 60,000[5(f— 35,000) + 8(f + 35,000)] + 8rect ad saa) 

(c) same as part (b) (d) 1.17 MW (e) 5000 W (£) 3000 W (g) SNRin = 234, SNRout = 390 
l=@ 


1 + a? — 2acos (22F) 
204 


l-e 
1 + e204 — 2e—1% cos (22F) 
Tw Te 
1-—+—tni(2F 
8 * 4 eg) 


(b) Psinc(k/2) 


= e /2aFM 
(a) Y= (Xn—m4i+.---+Xn)/M (b) V7 


M-—j\k 
: (c) o? | | for ti=0,1,...M—1 
(1 — e7?2F) Me 


and zero otherwise 
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De Morgan’s laws, 5 

Devore, Jay L., xxix, 356, 462, 467, 496, 
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Dirac delta function, 740 

Discrete distributions, 745-746 

Discrete random variable, 179 
probability distributions for, 86-89 
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simulation of, 160-164 
Discrete sequences, 631-633 
Discrete-time Fourier transform (DTFT), 
742-743 
Discrete-time random processes, 629-633 
Discrete-time signal processing 
discrete-time Fourier transform, 715 
random sequences 
and LTI systems, 717-719 
and sampling, 720-721 


E 
Engineering functions, 739-740 
Erlang distribution, 231 
Error(s) 
estimated standard, 437 
in hypothesis testing, 480-484 
simulation of random events, 67 
standard, 124 
of mean, 84, 353 
point estimation, 437 
Type I, 481-484 
Type II, 481-484 
Estimated standard error, 437 
Estimators, 433-436 
Event(s) 
complement, 4 
compound, 3 
definition, 3 
De Morgan’s laws, 5 
dependent, 53 
disjoint, 5 
independent, 53 
intersection, 4 
mutually exclusive, 5 
probability of, 9 
relations from set theory, 4-6 
relative frequency, 12 
simple, 3 
simulation, 62-63 
estimated/standard error, 67 
precision, 67-68 
RNG, 63-67 
union, 4 
venn diagrams, 6 
Expected value, 100-104, 196-200, 307-309 
of function, 104-107 
linearity of, 105-107 
properties, 309-310 
Experiment 
definition, 1 
sample space of, 2-3 
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Exponential distribution, 226-229 
and gamma distribution, 639-642 


F 

Fermat, Pierre de, xix 

Finite population correction factor, 143 

Fisher, R.A., 449 

Fourier transform, 741-742 
discrete-time, 715, 742-743 


G 
Galton, Francis, 378 
Gamma distribution, 229-232 
calculations with software, 233 
Erlang and, 231 
exponential and, 639-642 
MGF, 232-233 
standard, 230 
Gamma function, 229 
incomplete, 231 
Gaussian/normal distribution, 207—208 
binomial distribution, approximating, 
218-219 
calculations with software, 217 
and discrete populations, 217 
non-standardized, 211-215 
normal MGF, 215-217 
standard, 208-211 
Gaussian processes, 652-654 
Gaussian white noise, 692 
Geometric distributions, 90 
negative, 143-146 
Geometric random variable, 145 
Gosset, William Sealy, 463 
Gosset’s theorem, 463 


H 
Highpass filter, 705, 706 
Hoaglin, David, 457 
Hypergeometric distribution, 139-143 
Hypothesis testing 
about population mean, 473 
alternative, 474 
errors in, 480-484 
null, 474 
population proportion, 494-496 
power of test, 480-484 


P-values and one-sample t¢ test, 476-480 


significance level, 478 
software for, 484-485 
statistical, 473 
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test procedures, 473-475 
about population mean p, 475-476 
test statistic, 476 
Type I error, 481-484 
Type II error, 481-484 


Ideal filters, 705-708 
Impulse function, 740 
Incomplete gamma function, 733 
Independence, 53-54 


events, 54-58 
mutually, 56 


Interval estimate, 460 

Inverse CDF method, 265-269 
Inverse DTFT, 743 

Inverse Fourier transform, 742 


J 


Jointly wide-sense stationary, 619 
Joint probability density function, 290-294 
Joint probability distributions, 287 


conditional distributions, 336-338 
and independence, 338-339 
conditional expectation, 336-338 
and variance, 339-341 
correlation, 307-309, 313-316 
vs. causation, 316 
coefficient, 314 
covariance, 307-313 
dependent random variables, 295 
expected values, 307-309 
properties, 309-310 
independent random variables, 294—296 
joint PDF, 290-294 
joint PMF, 288-290 
joint probability table, 289 
Law of Large Numbers, 362-364 
Laws of Total Expectation and Variance, 
341-347 
linear combinations, properties, 320-325 
convolution, 325 
moment generating functions, 
327-330 
PDF of sum, 325-327 
theorem, 321 
marginal probability density functions, 
292 
marginal probability mass functions, 289 
multinomial distribution, 298 
multinomial experiment, 298 
order statistics, 396 
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Joint probability distributions (cont.) 
distributions of Y,, and Y,, 397-399 
ith order statistic distribution, 399-400 
joint distribution of n order statistics, 
401-402 
of more than 2 rvs, 296-300 
reliability (see Reliability) 
simulations methods (see Simulations 
methods) 
transformations of variables, 367-373 
Joint probability mass function, 288-290 


K 
Kahneman, Daniel, xxvii 
Karlin, Samuel, 552, 553 


L 
Law of Large Numbers, 362-364 
Law of Total Probability, 42-43 
Laws of Total Expectation, 341-347 
Laws of Variance, 341-347 
Likelihood function, 451 
Limit theorems 
CLT, 356-360 
applications of, 360-362 
independent and identically 
distributed, 352 
random samples, 352-355 
standard error of mean, 353 
Linear combinations, properties, 320-325 
convolution, 325 
moment generating functions, 327-330 
PDF of sum, 325-327 
theorem, 321 
Linear, time-invariant (LTD system, 699-700 
butterworth filters, 708 
ideal filters, 705-708 
impulse response, 700 
power signal-to-noise ratio, 709 
random sequences and, 717-719 
signal plus noise, 708-711 
statistical properties of, 700-705 
transfer function, 700 
Lognormal distributions, 239-242 
Lowpass filter, 705, 706 
LTI system. See Linear, time-invariant (LTI) 
system 


M 

Marginal probability density functions, 292 
Marginal probability mass functions, 289 
Markov, Andrey A., 521 
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Markov chains, 521 
with absorbing states, 561-563 
canonical form, 573 
Chapman—Kolmogorov Equations, 
530-537 
conditional probabilities, 525 
continuous-time, 523 
discrete-space, 523 
discrete-time, 523 
eventual absorption probabilities, 
572-575 
finite-state, 523 
initial distribution, specifying, 542-546 
initial state, 523 
irreducible chains, 557-558 
mean first passage times, 571-572 
mean time to absorption, 566-571 
one-step transition probabilities, 525 
periodic chains, 557-558 
Markov process, 663 
birth and death process, 670 
continuous-time, 663-664 
explicit form of transition matrix, 
673-675 
generator matrix, 671 
infinitesimal parameters, 666 
instantaneous transition rates, 666 
long-run behavior, 670-673 
sojourn times, transition and, 664-670 
time homogeneous, 664 
transition probabilities, 663 
property, 522-527 
regular, 549-551 
simulation, 579-586 
states, 522 
state space, 522 
steady-state distribution and, 553-554 
Steady-State Theorem, 551-552 
time-homogeneous, 523 
time to absorption, 563-566 
transition 
matrix, 530-532 
probabilities, 525, 532-537 
Matlab 
probability distributions in, 750 
probability plots in, 256 
and R commands, xx 
simulation implemented in, 164-165 
Maximum likelihood estimation (MLE), 
448-457 
Mean 
and autocorrelation functions, 605-613 
first passage times, 571-572 
recurrence time, 553 
and variance functions, 605-609 
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Mean square 
sense, 625 
value, 620 
Mean time to absorption (MTTA), 566-571 
Mean value. See Expected value 
Mendel, Gregor, 544 
Minimum variance unbiased estimator 
(MVUEB), 440 
Moment generating functions (MGF), 152-154, 
201-203 
of common distributions, 157-158 
gamma distributions, 232-233 
normal, 215-217 
obtaining moments from, 155-157 
Moments, 150-152 
from MGF, 155-157 
skewness coefficient, 152 
MTTA. See Mean time to absorption (MTTA) 
Multinomial distribution, 298 
Multinomial experiment, 298 
Multiplication rule, 39-42 
Multivariate normal distribution, 379-380 
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Negative binomial distributions 

alternative definition, 146-147 

and geometric distributions, 143-146 
Normal distributions. See Gaussian/normal 
Notch filter. See Bandstop filter 
Nyquist rate, 720 
Nyquist sampling theorem, 720 
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Olofsson, Peter, xxix 

o(h) notation, 740 

Order statistics, 396 
distributions of Y,, and Y,, 397-399 
ith order statistic distribution, 399-400 
joint distribution of the n order statistics, 

401-402 
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Pascal, Blaise, xix 
PDF. See Probability density function (PDF) 
Peebles, Peyton, 708 
Periodic chains, 557-558 
Permutations, 25—27 
PMF. See Probability mass function (PMF) 
Point estimation, 429 
accuracy and precision, 436-440 
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estimated standard error, 437 
estimates and estimators, 433-436 
parameter, 430 
sample mean, 431 
sample median, 431 
sample range, 432 
sample standard deviation, 432 
sample variance, 432 
standard error, 437 
statistic, 431 
unbiased estimator, 437 
Poisson cumulative distribution function, 
730-731 
Poisson distribution, 130-131 
with binomial distributions, 133 
as limit, 132-134 
mean and variance, 134 
poisson process, 135-136 
with software, 136 
Poisson process, 135-136, 636-639 
alternative definition, 644-646 
combining and decomposing, 642-644 
exponential and gamma distributions, 
639-642 
independent increments, 637 
intensity function, 646 
non-homogeneous, 646-647 
rate, 637 
spatial, 646 
stationary increments, 637 
telegraphic process, 647-648 
Population proportion 
confidence intervals, 491-494 
hypothesis testing, 494-496 
score confidence interval, 492 
software for inferences, 496 
Power spectral density (PSD), 683 
average/expected power, 684 
cross-power spectral density, 694 
in frequency band, 690-692 
partitioning, 688-690 
properties, 687-690 
for two processes, 694-696 
white noise processes, 692-694 
Wiener—-Khinchin Theorem, 686 
Precision, 165-168 
Principle of Unbiased Estimation, 439 
Probability 
Addition Rule, 14-16 
application, xx—xxi 
to business, xxi—xxii 
to engineering and operations research, 
XXii-XXV 
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to finance, xxv—xxvi 
to life sciences, xxii—xxiii 
axioms, 9-11 
Complement Rule, 13-14 
conditional, 36-37 
Bayes’ theorem, 44-45 
definition, 37-39 
Law of Total Probability, 42-43 
multiplication rule, 39-42 
counting methods 
combinations, 27-31 
fundamental principle, 22—24 
k-tuple, 23 
permutations, 25-27 
tree diagrams, 24—25 
coupon collector problem, xx—xxi 
definition, 1 
De Morgan’s laws, 5 
determining systematically, 16 
development of, xix 
events, 3-4, 9 
of eventual absorption, 572-575 
in everyday life, xxvii—xxxii 
experiment, | 
game theory, xx 
inclusion—exclusion principle, 16 
independence, 53-54 
events, 54-58 
mutually, 56 
interpretations, 9-13 
outcomes, 16-17 
properties, 9-11, 13-16 
relations from set theory, 4-6 
sample spaces, 2-3 
simulation of random events, 62-63 
estimated/standard error, 67 
precision, 67-68 
RNG, 63-67 
software in, xx 
transition, 532-537 
vectorization, 65 


Probability density function (PDF) 


continuous distribution, percentiles of, 
189-191 

for continuous variables, 180-185 

and cumulative distribution functions, 
179-180 

joint, 290-294, 406-409 

marginal PDF, 292 

median of, 189 

obtaining f(x) from F(x), 189 

symmetric, 191 


uniform distribution, 182 
using F(x) to compute probabilities, 
187-189 
Probability distributions 
continuous distributions, 747-749 
cumulative, 91-95 
discrete distributions, 745-746 
for discrete random variables, 86-89 
family of, 90 
geometric distribution, 90 
Matlab and R commands, 749-750 
parameter of, 89-90 
Probability histogram, 89 
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Probability mass function (PMF), 87, 288-290 


joint, 404-406 
marginal PMF, 289 
view of, 95 
Probability plots, 248-251 
beyond normality, 254—255 
departures from normality, 251-253 
location and scale parameters, 254 
in Matlab and R, 256 
normal, 250 
sample percentiles, 247-248 
shape parameter, 254 
PSD. See Power spectral density (PSD) 
P-values, 476-480 
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R 
probability distributions in, 750 
probability plots in, 256 
simulation implemented in, 164-165 
Random noise, 597 


Random number generator (RNG), 63-67 


Random process, 597 


autocovariance/autocorrelation functions, 


609-613 
classification, 601-602 
continuous-space process, 601 
continuous-time processes, 602 
discrete sequences, 631-633 
discrete-space process, 601 
discrete-time, 602, 629-633 
ensemble, 598 
independent, 613 
joint distribution of, 613 


mean and variance functions, 605-609 


orthogonal, 613 


poisson process (see Poisson process) 


random sequence, 602 


regarded as random variables, 602-604 
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sample function, 598 
stationary processes, 615-619 
types, 598-601 
uncorrelated, 613 
WSS (see Wide-sense stationary (WSS) 
processes) 
Random variable (RV), 81 
Bernoulli, 83 
binomial distribution, 119-121 
continuous, 84 
definition, 82 
discrete, 84 
transformations of, 259-264 
types, 84-85 
Random walk, 632 
Regular Markov chains, 549-551 
Reliability, 383 
function, 383-385 
hazard functions, 389-393 
mean time to failure, 388-389 
series and parallel designs, 385-388 
simulations methods for, 411-413 
RNG. See Random number generator (RNG) 
Ross, Sheldon, 271, 527 
RV. See Random variable (RV) 
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Sample mean 
definition, 84 
point estimation, 431 
Sample median, 247, 397, 431 
Sample space, 2-3 
Sample standard deviation, 84, 432 
Sample variance, 432 
Sampling 
interval, 720 
random sequences and, 720-721 
rate, 720 
Score confidence interval, 492 
SD. See Standard deviation (SD) 
Set theory, 4-6 
Signal processing 
discrete-time (see Discrete-time signal 
processing) 
LTI systems, random processes and, 
699-700 
ideal filters, 705-708 
signal plus noise, 708-711 
statistical properties of, 700-705 
power spectral density, 683-687 
power in frequency band, 690-692 
for processes, 694-696 
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properties, 687-690 
white noise processes, 692-694 
Simulation 
bivariate normal distribution, 409-411 
of discrete random variables, 160-164 
implemented in R and Matlab, 164-165 
of joint probability distributions/system 
reliability, 404-413 
mean, standard deviation, and precision, 
165-168 
for reliability, 411-413 
standard error of mean, 84 
values from joint PDF, 406-409 
values from joint PMF, 404-406 
Standard deviation (SD), 165-168, 198 
Chebyshev’s inequality, 108-109 
definition, 107 
function, 606 
Standard error, 124 
of mean, 84, 353 
point estimation, 437 
Standard normal CDF, 731-732 
Standard normal random variable, 208—211 
Stationary processes, 615-619 
definition, 616 
ergodic processes, 624-626 
Statistical inference 
Bayesian inference (see Bayesian 
inference) 
Bayesian method, 430 
CI (see Confidence interval (CI)) 
hypothesis testing (see Hypothesis testing) 
maximum likelihood estimation, 448-457 
point estimation (see Point estimation) 
population proportion, inferences for 
confidence intervals, 491-494 
hypothesis testing, 494-496 
score confidence interval, 492 
software for inferences, 496 
Steady-state distribution, 552-554 
Steady-state probabilities, 554-557 
Steady-State Theorem, 551-552 
Step function, 92 
Stochastic processes. See Random process 
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properties, 462 
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Transformations of random variable, 259-264 


Transition matrix, 530-532 

Transition probability 
multi-step, 532-537 
one-step, 525 

Tree diagrams, 24—25 

Trigonometric identities, 739 

t test, one-sample, 476-480 

Tversky, Amos, xxvii 


U 
Uncorrelated random processes, 613 
Uniform distribution, 182 
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Variance 
Chebyshev’s inequality, 108-109 
conditional expectation and, 339-341 
definition, 107 
functions, 605—609 
Laws of Total Expectation and, 341-347 
mean-square value, 110 
properties, 109-111 
shortcut formula, 109-111 

Venn diagram, 6 

Volcker, Paul, xxvi 
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WwW 
Weibull distributions, 236-239 
Weibull, Waloddi, 236 
White noise processes, 692-694 
Wide-sense stationary (WSS) processes, 
615-619 
autocorrelation ergodic, 626 
dc power offset, 623-624 
definition, 617 
ergodic processes, 624-626 
mean ergodic, 625 
mean square 
sense, 625 
value, 620 
properties, 620-624 
time autocorrelation, 626 
time average, 624-626 
Wiener—Khinchin Theorem, 686, 694-696 
Wiener process. See Brownian motion 
process 
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Wood, Fred, 253 
WSS processes. See Wide-sense stationary 
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Z 
z interval, one-proportion, 492 
z test, one-proportion, 492 


